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PREFACE 


The 1973 Sagamore Computer Conference on Parallel Processing was held on August 22-24, 1973 at the 
former Vanderbilt summer estate in the Central Adirondack Mountains. The Conference was conceived to 
provide a secluded environment, a 1300-acre preserve surrounding the private Sagamore Lake, for the 
participants with excellent opportunities for exchanging ideas and learning each others research activi- 
ties. Thus, informative discussions may be made not only during the technical sessions but also through- 
out the various sports and social gatherings provided by the Conference. 

The enthusiastic cooperation and response that we received throughout the Conference and during 
its preparation was indeed most heartening. We not only received many more papers than we could possibly 
schedule, but also the number of requests to attend exceeded the Sagamore accomodations. Thus, there 
seems to be a popular demand for such a conference in parallel processing. Another conference is being 
scheduled for the next year — August 21-23, 1974. 

The success of such a conference requires the vigorous support of many individuals. In this respect, 
we are most grateful to all the authors who submitted their papers for consideration. It is our deep 
regret that not all qualified papers could be scheduled for the Conference. We are also much indebted 
to all the reviewers who, in order to meet the stringent review deadlines, put aside their own busy work 
schedule to carefully evaluate the papers sent for their judgement. Their valuable comments not only 
resulted a set of high-quality papers for the Conference, but also were sincerely appreciated by many 
authors. The generous help we received from the session chairmen also contributes much to the success 
of the Conference. In addition, we wish to acknowledge the excellent cooperation provided to us by 
IEEE, IEEE Computer Society, ACM, their local aneneee chairmen, as well as the staff of various techni- 
cal magazines. In particular, we are indebted to Mr. James J. Andover, Mr. Charles Casale, Dr. W. Smith 
Dorsey, Prof. Michael J, Flynn, Prof. Caxton C. Foster, Mrs. Irene Hollister, Mr. David Jacobsohn, Mr. 
John L. Kirkley, Mr. E. D. MacDonald, Prof. Harold S. Stone, and many others for their assistance in 
achieving such a cooperation. Special thanks are also due to members of various committees. Their time 


and effort devoted to the Conference are indeed invaluable. 


Tse~yun Feng 
Department of Electrical & Computer Engineering 


Syracuse University 
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THE COORDINATE METHOD FOR THE 
PARALLEL EXECUTION OF DO LOOPS 


Leslie Lamport 


Massachusetts Computer Associates 
Wakefield, Massachusetts 01880 


Abstract -- An algorithm is presented 
which translates a program with nested sequential 
DO loops into one suitable for execution on a par- 
allel array or vector computer. If necessary, 
extensive rearrangement of the program's structure 
is made. 


Introduction 


We consider the problem of compiling ordi- 
nary sequential programs for execution on a par- 
allel array or vector computer such as the Illiac IV 
or the CDC Star-100. This problem is of practical 
importance for the following reasons: 

(1) There exist sequential programs which 
one would like to run on these parallel computers. 

(2) Ifa program is to be run on two dif- 
ferent machines, it might be best to write it in 
sequential form and let each compiler find the most 
efficient parallel execution for its computer. 

(3) A comodiler may be able to find more 
parallelism in a program than the programmer can. 
(See [1].) | 

The methods which we introduce should 
also be useful in other areas of program optimiza= 
tion. 

We consider a FORTRAN program containing 
DO loops, and describe a method of translating it 
into an extended FORTRAN program in which one or. 
more of the DO loops is executed in parallel. 

This is an obvious approach, and has been used in 
[2] - [4]. The method presented here generalizes 
the coordinate method of [4], and is more general 
than the analogous methods of [2] and [3]. Al- 
though our exposition is self-contained, it is best 
to read [4] first. 

We specify parallel execution with a 
DO SIM statement of the following form: 

DO 99 SIM FORALLI ¢€ 8, 
where 8 is a set of integers. The statements in 
its range are executed one after another as usual. 
However, each statement is executed simulta 
neously for all of the indicated values of I. An 
assignment statement is executed by first com- 
puting the right-hand side for each value of I, 
then simultaneously performing the assignments. 
Thus, the statement 

A() = A(I-1) + B(D 
would simultaneously set A(i) equal to the orig- 
inal value of A(i- 1) plus B(i) , for each value 
of iin §. 


The coordinate method tries to change DO 
loops to DO SIM loops. We show that it suffices 
to consider one DO loop ata time. The basic 
method is the coordinate algorithm, which we il- 
lustrate by an example. Suppose we are given 
the following program. 


Program 1]: 


DO 100 I=2,P 


DO 10 J=1,I1 


IF (B (I) . LT. 0) GO TO 25 
a 


DO 20 K = 2, Q(I) 


U (I,K) = 2 * UCI, K-1) 
[3] 


B(T) = B(T) + (U(I-1,K)-U(I+1,K)) *2 


2} fos] ual u5| 


C(K) = C(K) + K *U(TI, K) 
a he 
aS D + B(T) 

a 


DO 30 SIM FOR ALL L € f2,..., 100} 


DO 30 M = 4,50 


100 CONTINUE 


This is a nonsensical program, but it will 
serve to illustrate most details of the algorithm. 
Each occurrence of a non-index variable is given 
a name, which appears in a box beneath it. Loop 
bodies are boxed and labeled for legibility. The 
L loop might have been changed froma DO toa 
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DO SIM loop by a previous application of the al- 
gorithm. 

The coordinate algorithm can translate 
Program 1 into the following equivalent program. 


Program 2: 


TMP1 =P 


DO 901 i, = 2, TMP1 


DO lo J = 1,17, 


901 CONTINUE 


DO 902 SIM FOR ALL i; efi: 2<¢ is TMP1} 


4! 


TMP2 (I,) = .NOT, (B(T,) .LT. 0) 
[o3] 
IF (TMP2 (I,)) TMP3(T,) = Qd1,) 


DO 30 SIM FOR ALL L € {2,..., 1003 
. — 
DO 30 M = 4, 50 


30 ECT, L, MW) =-Ed,-1, L+1, M-3) * B(,+1) 


el bs] 


902 CONTINUE 


DO 920 K = 2, MAXIMUM ({TMP3 (i): 
2<i< TMP1 .AND. TMP2(i)}) 


DO 904 SIM FOR ALL yt {i:K = TMP3 (i) 
AND. 2<is TMP1 .AND. TMP2 (i)? 


TMP4 1) acu I, +1,K) 
U(T,, K) = 2*U(,,K-1) 
fa 3) 


B(I,) = B( 4 


y+ (UCT, - 1, K) - TMP4 (T,)) ** 2 


904 CONTINUE 


DO 905 I, 


5 = 2, TMPL 


IF (TMP2 (I,) AND, Ks TMP3 (I,)) 


C(K) = C(K) + TMP5 (I) 
cil ~— [ea 


905 CONTINUE. 


920 CONTINUE 


(a) We need only assume that we know which 
data can be modified by a subroutine or function 
call, but this would complicate matters. 


DO 903 I, 


IF (TMP2 (I, 


= 2, TMP1 


)) D= D +B (I,) 


903 CONTINUE 


Observe that the I loop of Program 1 has 
been split into the five Tag ee I, loops. Two 


of these are DO SIM loops, so Program 2 has 
more parallel execution than Program 1. Note the 
extensive rearrangement of Program 1 needed to 
achieve this parallelism. The L/M loop has 
been moved before the K loop; statement 20 has 
been split into two parts which appear inside dif- 
ferent loops; the u5 occurrence has been moved; 
etc. Of course, this example is contrived to 
demonstrate the power of the algorithm. 


In general, we consider an extended 
FORTRAN program containing DO and DO SIM 
loops, with the following restrictions. 

1. There is no backward transfer of con- 
trol other than that implied by the DO loops. 
Thus, if all DO and DO SIM statements were re- 
moved, then the resulting program would have no 
loops. (Techniques for translating programmed 
loops into DO loops are described in [3].) 

2. There is no I/O statement. We assume 
that input/output is done with the initial/final 
values of variables. 

3. The increment of every DO loop is a 
constant which is known at compile time. 

4, There is no transfer of control from in- 
side the range of a DO or DO SIM loop to outside 
its range - i.e., no premature exits from loops. 

5. There is no subroutine call, and no 
function call which can change the value ofa 
variable. The value of a function must depend 


only on the values of its arguments. (a) 

The program which we consider here may 
be any portion of an actual FORTRAN program 
having a single entry point. In particular, it may 
consist of a single DO loop. Hence, these re- 
strictions are reasonable. 


Space limitations require that we eliminate 
many details,including the proofs of theorems. 
They will appear in [5]. 


Representation of the Program 


For our analysis, we need a way of repre- 
senting a program which is more convenient than 
the original FORTRAN representation. To simplify 
the exposition, we assume that all DO loop incre- 
ments equal 1. The generalization to arbitrary 
increments is described later. 


The Program Tree 


The first part of our representation is the 
program tree, which describes a program's nested 
loop structure. The terminal nodes of the tree 
represent occurrences of variables. (Occurrences 
of DO and DO SIM index variables are excluded.) 
The non-terminal nodes represent the DO and DO 
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SIM loop bodies. A dummy node, labeled o , is 
placed at the top of the tree. 

The program trees of Programs 1 and 2 are 
shown in Figures 1 and 2, respectively. (For 
Program 2, we have excluded occurrences of 
TMP1, ..., TMP5 from the tree.) Occurrence 
nodes are denoted by boxes. Loop nodes are de- 
noted by circles and labeled by the index variable 
name. (We assume that each loop has a unique 
index variable.) DO SIM nodes are distinguished 
by concentric circles. 


We use paternity relations to describe tree 
structure. In Figure 1, the J node is the father 
of the ul node and the son of the I node. The 
o node is an ancestor of all other nodes. 

We let 7(3) denote the set of all nodes of 
atree 3 , and G(3) denote the set of all terminal 
nodes. If @ is any node ofa tree, then J(q@) 
denotes the subtree headed by a. We let 7?(@) 
and @() denote 7[3@)] and G[3@)], respect- 


ively. In Figure 2, O(T,) ={bl,q,el, e2, b5}. 


A non-empty sequence of nodes 
Apr cee A is called a branch of a tree if o is 


the father of Oss and each Oy is the father of 
,, - Three branches of Figure l are: (1) I, 
K; (2) I, J, a2; and (3)p. 


HpoecGianc 


If a@ and 8 are two nodes ofa tree, we 
let af8 denote their most recent common ances- 
tor. In Figure 1, we have alfia2 = J, 
u3 N J=Iand pnNq = o. Wedefine ana 
to be the father of a. 


Let f and g be occurrences ina program. 
We say that f precedes g if there is a flow path 
from f to g in which each DO loop is executed 
at most once. In Program 1, u3 precedes u2, u5, 
e2, etc. By restriction 1, if f precedes g then 
g cannot precede f. 

The motive for the following definition 
comes from considering f+ g to mean that the 
occurrence f must precede the occurrence g. 


* 
Then ~ contains precedence relations on the loop 
nodes implied by >*. 
Definition 1: Let J bea tree and let * be any 
relation on ©(3) . The tree completion of > is 


* 
the smallest relation 7% on 7(3) which satisfies 
the following conditions: 


(1) If fag then f2g ; 
(2) If a> 8 ; g 3 y and a is neither 


* 
an ancestor nor a descendant of y , then a?*y. 


© 


7SRRAEEEE « 


fa] [ 


Figure 1 
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Index Sets 


* 
(3) If a*+8 ,a #8, a@' is either the father 


: n 
of a orelse a'=a, 8" is either the father of Let &Z denote the set of all n-tuples of 


integers, with the usual operations of addition 


* 
B orelse §8'=8 , and a'}78'; then a'>B'., re 
The relation ~ is said to be tree incon- and dade ae and let 0 = (0, 0, ..., 0). We 
define Z ={0}. 


* 
sistent if a +a for some node a. Otherwise, Let a be a loop node of a program tree 


+ is said to be tree consistent. 1 
and let I, ..., Tl" be the branch with qf? = QM. 


A partition of a set is a collection of pair- Then |a| is defined to equal n. We define 
wise disjoint subsets whose union equals the % tobethe set ZI@ 
whole set. Let P= f{ Spr eees Ss} be a partition OL The relationg ~ and < on. 2%. are de- 


ofa set S , and let + bea relation on S. The ’ 


jy Jk 
relation > induced on P by ~ is defined by fined as follows. Let I, ..., 1° be the DO 


s, + S, if and only if s; 7 s, and there exist | 
se 5, and te 5, such that sat. {L,q,L} 
A tree partition P ofa tree J isa par- 

tition { Nysieees N,} of 2(3) satisfying the 
following property: If @e« N, ,Pandye N, : 

N, 7 N, and a is the father of 8 , then the father 
of y is contained in either N, or N, . We give 


P a tree structure by letting the father/son rela- 
tion be the one induced on P by the father/son 
relation of 3. 


As an example, let 3 be the subtree 3(1,) 
of Figure 2. Then {bl}, {L. aq, Lb}, {M, el, 


b5}, {e2} is a tree partition of 3. Its tree Figure 3 
structure is shown in Figure 3. 
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nodes among the p (the remaining p being DO 
SIM nodes), with jy ogg ie ip - For any ele- 


ments (at, ..., a”) and (bt, ..., b™) of Z, 
we let . 


(GP iwse pa) . (Bos ocean bo) 
j j j j 
Gt siege th Sb ei 1s 
(2) Gs aeagay <b> adeno) at 
jy Jy 
(a, ..e., a) is lexicographically smaller than 
j j 
(b a Mee Ky , reading the components from left 
to right. 


In Figure 1, we have |M| = 3, Z =2°, 


M 
and (3, 7, 2) ~ (3, 10, 2) < (3, -1, 4). 
1 


The element A= (a, ..., a”) of %. re- 


presents a possible execution of the body of the 


a loop for thea!. eed tr =a" - Forany ele- 


ment B of “i , A ~ B if the executions of the 


a loop body for A and B occur simultaneously, 
and A < B if the execution for A precedes the 
execution for B . Note that this defines the 
meaning of a DO loop inside a DO SIM loop. (It 
is not the meaning one might expect if the lower 
DO limit depends upon the DO SIM index variable. 
We define !o| = 0 , and let the relation 
~ on &% = {0} be defined by 0 ~ 0. If f is 


an occurrence node whose father is a , then we 


let |f! = lal and Z, = Z. An element of 
WA f represents a possible execution of the occur- 
rence f. 
Now let the P be as above and let 6 = 
x ,k < n. We define the projection mapping 
1 n 1 

. id WA te e@on@eg = eee 
tt, Li g by If (a a’) (a’, P 
ak) . if an occurrence node f is the sonofa, 
then we let 0, = Ws - In Figure l, 
mt : Z° + @ is defined by IS! (i, #, m)= (i). 


The reader can verify the following fact. 
Proposition 2: Let f and g be occurrences, 
Pe Zs and Q € aie - The execution of f for 
P precedes the execution of g for Q if either 


f g 
(i) Teng P) < Teng (2) , oF 


f g 
fing \P) ~ Teng (2) and f precedes g. 


For any node a , we define the index set 
dy to be the subset of Ley consisting of those 


(ii) 1 


elements for which a is actually executed. In 
Program 1, we have: 


= J. = {(i,j):2 < i < P and 
oe l<ej<i 
J = {(i):2 < i < P and B(i) > 0} 
J 2 = Jy = {(i,k) :2<i<P, 2<k< Q(i) 


and B(i) > 0}. 


Note that in general, Jy may depend upon the 


initial values of variables, and often will not be 
known at compile time. 


Occurrences 


An occurrence of a variable is called a 
generation if it appears on the left-hand side of an 
assignment statement, otherwise it is called a 
use. A relevant occurrence pair is an ordered pair 
of occurrences of a single variable, at least one 
of which is a generation. In Program 1, there are 
three relevant pairs of occurrences of the variable 
E: (I) el, e2: (2) e2, el; and (3) el, el. 


Execution of the occurrence b5 of Program 
1 for an element (i, 2, m) in 3 references 


b5 
the (i+l) element of the array B. This defines 
the occurrence mapping Ths ; Ins + xz! given by 
Ths le @, m) = (itl) . In general, let f bean 


occurrence of a k-dimensional array variable. (A 
scalar is considered to be a O-dimensional array.) 


Then T, 7 de + uk . The mapping Tr may not be 


known at compile time. (b) 


Definition 3: Let f, g be a relevant occurrence 
pair. We define << f, g >> to be'’the set 
{X ¢ Zeng there exist Pe J, and Qe a5 such 


that T,(P) = T,(Q) and X= ng (Q) Z OP) 


We define <f, g> tabe some fixed sub- 
set of eng , Known at compile time, which 


contains <<f,g>>. 


An element X of <<f, g>> implies the 
existence of elements P ¢€ Je and Qe oa such 


that the executions of f for P and g for Q 
reference the same array element. Since A <B if 


B-A> 0 , Proposition 2 implies that the reference 
by f precedes the reference by g if either 


(i)X>0O or (ii) X~O and f precedes g. 
Some <<f, g>> sets for Program 1 are: 
<< el, e2 >> Ml -1, 3)} 
<< ul, u2 >> (1)} if 3. 46 (c) 


<<b2, b3 >> = {(0, k) : 2-Q(i)< k< 
Q(i) - 2 for some ie 3, ; 


The set <f, g> is the best "upper bound" 
on the set << f, g >> which the compiler can 
find. Computing these sets is a major implemen- 
tation problem which we will not discuss. We 
assume that the compiler finds the following 
<f,g> sets for Program l. 


(b) 1 


f appears in a DO SIM set expression, then 
Te could be a multi-valued mapping. To handle 


— ca 


this case, replace any statement in this paper of 
the form "... T(P) ee. by "there exists an 


X€ T,(P) such that... X...". 


(c) We let ¢ denote the empty set. 
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<al, al>=<al, a2>=<ad2, al>= 2" 

<bl, b2>=<b2, bl>=<b2, b4> 
=<b4, b2 >= { (0)} 

<b2, b2>=<b2,b3 >=<b3, b2> 
= {(0, k) : k any integer} 

<b2, b5 > = { (-1) 

<b5, b2>= { (1)} 

<cecl, cl >=<cl, c2>=<c2, cl> 
= {(i, 0) : iany integer} 

<dl, dl>=<dl, d2>=<d2, dl>=Zz" 

<el, el>={(0, 0, 0)} 

<el, e2>={(1, -1, 3)} 

<e2, el>={(-1, 1, -3)} 

<u2Z, u2>=<u2, u6>= <u6, u2> 
= {(0, 0)} 

<ul, u2 >= {(1)} 

<u2, ul > = {(-1)} 

<u2, u3>={(0, 1)} 

<u3, u2>={(0, -1)} 

<u2, ud>=<u5, u2 >= {(1, 0)} 

<u4, u2>=<u2, u5>={(-1, 0)} 


Precedence Relations 


The FORTRAN representation of a program 
usually specifies more precedence relations among 
the occurrences than are necessary. For example, 
b2 need not precede cl in Program 1. We now 
describe all the precedence relations that are 
necessary in order to specify the correct execution 
of a program. These are of two types. The first, 
denoted by => , describes those precedence re- 
lations which are logically necessary for a mean- 
ingful execution of the program. 


Definition 4: For occurrences f and g ina 
FORTRAN program, we write f = g in any of the 
following cases: 
1. (a) g is a generation and f appears 
on the right-hand side of the assignment 
statement of g. 
(o) f appears in a subscript expres- 
sion of g. 
2. (a) f£ appears in the conditional ex- 
pression of a conditional branch, and g 
appears in a statement whose execution is 
conditional upon which branch is taken. 
(bo) f£ appears in the limits of a DO 
statement, or in the index set expression 
of a DO SIM statement, whose range con- 
tains g. 


In 2(a), we consider a conditional assign- 
ment statement to consist of a conditional branch 
and an assignment statement. 

The relations = for Program 1 are indi- 
cated in Figure 4. E.g., the = inthe a2 row, 
al column denotes the relation a2 = al. 


The second form of precedence relation, 
denoted by ~ , is necessitated by data conflicts. 
If a generation and any other occurrence refer- 
ence the same array element, then the order of the 
references must be specified. Our previous re- 
marks then lead to the following definition. 


Definition 5: For each relevant pair of occurrences 
f, g with f7%g, we let f*g if and only if f 


precedes g and there exists an element 
Xe<f, g> with X~0. 


The relations for Program 1 are shown in 
Figure 4. 

We let = denote the union of the rela- 
tions 7 and >,sof>gif fg o f>g. 
Then = gives all precedence relations neces- 
sary for the proper execution of the program. It 
can be used to determine, for example, that during 
an iteration of the I loop of Program 1, the J and 
K loops can be executed concurrently by two in- 
dependent processors. This yields a generaliza- 
tion of the methods of [6]. However, this type of 
parallelism will not be discussed here. 


The Complete Representation 
We define a program specification $ to 


consist of the following: 


Sl. A program tree, also denoted by 8. 
S2. The precedence relations * and =>, 
S3. A specification of the occurrence 
mapping for each occurrence. 

S4. A specification of the index set of 
each occurrence and of the assignment 
values for each generation, in terms of 
occurrences. 


Part S4 is quite vague. For Program 1, it 
might include the following: 
= {(i):2 < i < p and bl > 0}. 


al = a2 + ul**2., 

S3 is also vague if we consider occurrences like 
A(B(I), J). We will not need to define S3 and S4 
any more precisely because our translation proce- 
dure will leave these parts of the specification 
essentially unchanged. 

There are many criteria which must be met 
for S1-S4 to be a valid program specification. 
However, if $3 and S4 are assumed to be valid, 
then the following conditions are sufficient to in- 
sure that the entire specification is valid. 


Ll. (a) For each relevant pair of occur- 
rences f, g with £{7%g: if there exists an 


Xe <<f, g >> with x w0 y 
g>f. 


then either {+g or 


(bo) For each generation g: if X 
— o> 
€<<g, g>> and X~0O,then X=0. 
L2. The relation = is tree consistent. 


Note that to verify Ll, it suffices to ver- 
ify it with each set <<f, g>> replaced by 
as a © (aaa 

Given a valid program specification, we 
can use it to write an extended FORTRAN program, 
For example, we define a program specification 
as follows: 

Sl. The program tree is given by Figure 2. 


S2. Welet f+ or =g if the relation 
f+ or = g appears in Figure 4. We also add the 
following relations: ul * u2, u2 >? u4, 
uS + u2, and b5 * b2. 
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S3, S4. We obtain this from the specifica- 
tion of Program 1 in the obvious way. E.g., we 
have __ Bats ey 

al = a2 + ul **2 

Toykk, i) = Ta. k) = (i, k-1). 

The reader can check that this specification sat- 
isfies Ll and L2. A simple-minded translation of 
this specification into an extended FORTRAN pro- 
gram gives Program 2. (A more efficient translation 
is possible.) 

As this example shows, the problem of go- 
ing from a valid program specification to a 
FORTRAN program can be difficult. However, it is 
always possible. A compiler would probably not 
do this, but would translate the program specifi- 
cation into an internal form suitable for generating 
code. 


Program Mappings 
Linear Program Mappings 


Our basic idea is to transform a given pro 


gram specification $ into a new one $ which 
produces the same results,. but has more parallel 


computation. The tree of $ will be obtained by 
splitting apart and rearranging loop nodes of &. 


In the following definition, at @) is the set of 
nodes into which the loop node ae 7(8) is split. 
Definition 6: Let $ and §$ be program trees. A 
linear tree mapping 0:37 $ consists of: 


(1) A surjective mapping 8 -7(8) +7(8) 
such that: 
(a) 68 is a 1-1 correspondence between 


G($) and O(S). 

(b) If a is an ancestor of an occur- 

rence node f of 8 , then 6(@) is an 

ancestor of 6(f) . 

(c) For each f € G(8) : |®(f)| = |f] . 
We denote 6(f) by f for each fe Q(S) . 


(2) For each f € O(8) , a linear 1-1 cor- 
respondence oF : AF + are satisfying the follow- 


ing condition: Forany f, g € G(8) , the mapping 


oF f.g>° Leng ~ hing defined by 
f f£ \-1 
Q) =I _o Q, o ( 
<f,g> fg f fg 
is single-valued, and Qe as = Oe g, f>° 


As an example, let & , $ be the trees of 
Figures 1 and 2, respectively. We define the 


linear tree mapping 2:8-48 as follows. Let 
OD) = ase = 6(T.)=1, 49) =J, 9(al) = al, etc. 
Let OQ be the identity mapping unless f isa 


descendant of K , in which case let 
Q-(i, k) = (k, i) . We then have 


Oe ud, ug> {te k) 
Oe ud, cis k) = (k) 
Oe al, cl> fi) = 0. 


Definition 7: Let S$ , $ be program specifications. 


A linear program mapping 2:§° 8 consists of a 
linear tree mapping © from the tree of $ to that 


of S$ such that: = 

(1) Foreach fe¢OQ(8) , f and f are 
occurrences of the same variable, and T, = 
of hws re) SF. ° 


f ges =e 
(2) £ =>g in § ifand onlyif f ~g in 


(3) Replacing each occurrence f by f 
in S4 of the specification $ gives S4 of $. 


Part 3 of the definition is as vague as our 
definition of S4 of the program specification. 
However, its meaning should be clear from our ex- 
ample. The mapping Q defined above gives a 
linear program mapping from the specification of' 
Program 1 to that of Program 2. | 


We say that two program specifications 
are equivalent if they produce the same output 
when run with the same legal input values. (Re- 


call restriction 2.) Let 0:88 bea linear 
program mapping. To obtain the equivalence of 


& and § , we will assume that 9 satisfies the 
following condition. 


EL. For each relevant pair of occurrences 
f, ginS: if there exists an element Xe <f, g> 


such that either (i) X > 0 or (ii) X~ 0 and 


f+g , then either (i) Qe Ss (X)> 0 or (ii) 


Qe gg >X) ~ 0 and f * g. 


Theorem 8: Let S$ bea valid program specifica- 
tion, let $ bea specification satisfying L2, and 


let 2:$-48 bea linear program mapping satis- 
fying EL. Then 


(1) $ is a valid program specification. 


(2) $ and S$ are equivalent. 
(3) For each relevant occurrence pair f, 


gof$:<<f, g>> = (<<f, g>>). 


Qe f, g> 
Part 3 of the theorem allows us to choose 
<f,g> tobe dee g 


can check that Theorem 8 implies the equivalence 
of Programs 1 and 2. 


,(< f, g>). The reader 


Iwo Applications 


We now describe two simple ways of ob- 


taining a new program specification $ froma 
given specification S . We leave it to the reader 
to verify that Theorem 8 implies the equivalence of 


S$ ands. 
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1. Interchange a tightly nested DO/DO 
SIM pair of nodes. E.g., let S be the specifica- 


tion of the L loop of Program 1. Then § is the 
specification of the program: 

DO 30 M= 4, 50 

DO 30 SIM FORALL Le {2, ..., 100} 
30 E(I, L, M) = E(I-1, L+l1, M-3) * B(I +1). 


2. Split a single DO SIM node into sever- 
al - one for each son. Applying this to the above. 


specification $ then gives the specification g 
of the following program: 
DO 33 M= 4, 50 
DO 31 SIM FORALL L, € (2, ..., 100} 
31 TMPI(L,) = E(I-1, L,t1, M-3) 
DO 32 SIM FOR ALL L, € L24, ee9 1003 


32 TMP2(L,) = B(I+1) 
DO 33 SIM FORALL L, € {2, «e-, 100} 
33 E(I, Lz, M) = TMP1(L,) * TMP2(L3) . 


Note that this new version describes one 
way that the L loop of Program 1 might actually 
be executed by an array computer, TMP1 and TMP2 
representing arithmetic registers. 


In general, repeated application of these 
two rewriting procedures shows that DO SIM loops 
can always be rewritten in terms of vector assign- 
ment statements. 


Coordinate Mappings 


The mapping 2 £ for a linear program map- 


ping © may be any linear 1-1 correspondence. 
This allows a generalization of the hyperplane 
method of [4], which will be done in a later paper. 
For the coordinate method, we restrict O¢ to bea 
permutation of the coordinates. 


To form the tree of $ , we allowa DO 
node to be changed into one or more DO and/or 
DO SIM nodes, which may be moved lower in the 
tree. DO SIM nodes may not be changed, and no 
other rearrangement of nodes is allowed. 


Definition 9: A linear program mapping 


Q:348 is a coordinate mapping if there is a sub- 
set C of the DO nodes of § , called the set of 
changed nodes, satisfying the following conditions 
(where 8 is as in Definition 6): 


(1) For each fe O(S) , let I, weeks I", f 
be a branch of § , let iy = 6(1?), and let TT be 


the permutation such that pm) sia y(n) 


branch of &. Then 
(a) If j<k and m(j) > mk), then 


yi) eC and yk) ye 
Oy aceite Gee wind 


,tisa 


(2) For each loop node a of $ with 


agc: ante) consists of a single node of the 
same type (DO or DO SIM) as @. 


The mapping © defined above from the 
specification of Program 1 to that of Program 2 is 
a coordinate mapping with C= {I}. Thus, only 
the I node of Program 1 is changed by 2. 


For a coordinate mapping 0:$7$, we 
introduce the following condition. 


EC. For each relevant pair of occurrences 
f, ginS: ee 
(1) If f%g , thenfrg. 


(2) Foreach Xe <f,g> with X>0, 


(a) 
(b) 
The reader can verify that if a coordinate 


mapping satisfies EC, then it satisfies EL. 
Theorem 8 then gives the following result. 


either * 
(X)>0O, or 


—> 


a ee _ 
(X)~0O and fg. 


Nef gs 


Theorem 10: Let S bea valid program specifica- 
tion and 2:88 a coordinate mapping satisfying 


EC. If $ satisfies L2, then it is.a valid program 
specification and is equivalent to $. For any 


relevant occurrence pair f, g of $ , we can let 


<f,g> equal oars gate: g>). 

The following result shows that any coor- 
dinate mapping can be obtained from a sequence of 
coordinate mappings, each of which changes just 
one node. 


Theorem ll: Let S$, S be valid program speci- 


fications and 2:8$74S8S a coordinate mapping 
satisfying EC. Let a be any DO node of $ which 
is changed by 2 such that no descendant of a is 
changed by Q. Then there exists a valid program 
specification $' and coordinate mappings 


QO :S$98' and 2":S'4S satisfying EC such 
that ©' changes only a. 


The Coordinate Algorithm 


We now describe the coordinate algorithm. 
Given a program specification $ and a DO node 
I of S$ , the coordinate algorithm generates a pro- 


gram specification S$ anda coordinate mapping 
Q:$48 suchthat (i) 2 changes only I and 


satisfies EC, and (ii) $ satisfies L2. Theorem 


10 implies that S$ is equivalent to $ . The al- 
gorithm can find any possible coordinate mapping 
satisfying (i) and (ii). By Theorem 11, we want 
to apply the algorithm repeatedly, starting from 
the innermost loop nodes. 

We now describe, explain and illustrate 
the five major steps of the algorithm. 
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1. Let >>, = be the relations >, ~ 
on S$ which would result if the I node were 


changed to a DO SIM node. Define the relation 

~--> on Q(I) andthe subset @ of ?(D as 

follows. For each relevant pair of occurrences f, 
in O(]) , and each Xe <f > with 


K>0: 


(a) If X~0_, then include the rela- 


tion f-=-->9. 


(b) If X << 0 _then: for each des- 


cendant J of I which either equals or 
is an ancestor of fNg, if 


ed (X) << 0 then let J beanel- 


ement of &. 


The relations ~---> are the additional re- 
lations ~» required if we were to simply change 
the I node toa DO SIM node. The existence of 
an X satisfying (b) immediately precludes this 
possibility. 

The nodes in & are "blocking nodes". 
This means that for each Je &, J cannot appear 
inside a DO SIM I node, and none of the nodes 
into which I is changed can be moved below J. 

Applying step 1 tothe I node of Program 
l gives the relations ---> shown in Figure 4. 

It finds @ = {J}. E.g., for the occurrence pair 
uZ, u4 we have (1, 0) €<u2, u4>, (1, 0) 


> 0 and (1, 0)# 0 - Hence, (a) gives 
u2 ~--> u4. 

For the occurrence pair al, al: 
i>0O wehave (i, 0) ¢€<al, al>, (i, 0) > 0 and 
(i, 0) ~0. Hence (a) gives al --->al. For 
any j<0O ,wehave (i, j)€<al, al>, 

ma ; ~ . alnal 
(i, j) > 0 and (i, j) <<0. Since i ( 
= (i, j) , part (b) places J in @. 

Note that if L were a DO node, then step 
l applied to el, e2 would place L in &. This 


shows why the algorithm should be applied to 
inner loops first. 


for any 


i, j) 


2. Let ==> denote the relation on G(I 
formed by the union of the relations =>, ~ and 


* 
---> _, and let ==> denote its tree completion. 
Complete the set 4S as follows: for each node 


a of 721) , if a ==> then add a to @., 


* 
For Program 1, every relation @ ==> 8 on 
(I) for which aw ==>8 does not hold is indicated 
by an "*" in Figure 4. Step 2 then adds the fol- 
lowing nodes to &: al, a2, cl, c2, dl, d2. 


3. Choose a tree partition P of I(T) 
such that: 
(a) Any non-terminal node of P_con- 
sists of a single loop node of 7(1) 


which is notin &., 
(bo) The relation induced on OP) b 
==> is tree consistent. 


To obtain the maximum amount of parallel- 
lism, the partition P should be chosen to satisfy 
the following conditions as well: 
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(1) Forany aeN7(I): if af & and 

Nia) NB #PO, then {a } is one of the sets 
of P. 

(2) Forany a, BEND: ifNa)NB=ePy 
and 8 «€@&,thengw and 8 belong to 
different sets of P. 


There is an algorithm for choosing such a 
P . Applying it to our example, and then com- 
bining the resulting sets {dl, d2} and {b4 } 
into a single set, gives the tree partition P 
shown in Figure 5. In general, finding the best 
tree partition P is a major implementation prob- 
lem, 


4. Let Nj ogg IN in 
nodes of  . Define the program tree of S$ as fol 
lows: _ = 
(a) NS) =fa:aenS),aZ7U 
iTj.se- DT} . 
(b) Forany nodes a, 8 of S$ not 


equal to I; m is a descendant of B 
if and only if mw is a descendant of 8. 
c) For any node @ of 3 not equal to 


( 
i: = = 
(i) w is a descendant of I. if 
oS a cescendant of jit 
and only if a ¢€ Niu: 


(ii) a is an ancestor of I, if and 
only if either qw is an ancestor of 
I_or fo} is an ancestor of N,_- 


This defines a tree in which the I node is 


be the terminal 


split into the nodes I: wee, 1... Foreach j, 


=% a ai m 
mI) ={a:ae N;}U (13 ; 


In our example, S has the tree of Figure 
Zs 


5. Define the relation > on § as fol- 


lows. For any occurrence f, g,in $ _, we let 


f gq if either: 
(a) f > g in S,or 
(b) f--->g ~-fe NrgeN, and 
either (i) i#j, or (ii) i=j and 


L is a DO SIM node. 


In our example, step 5(b) gives the follow- 
ing relations: (i) ul * u2,b5 7% b2 and 
(ii) u2 * u4, uS *% u2. 


The mapping 6 of Definition 6 is defined 
by §@)=a@ if a#I1, and a(I,) =I. Parts 83 
and S4 of the specification $ and the coordinate 
mapping 2:8 4§ are then defined in the obvious 
way. 


The equivalence of § and §$ is implied 
by Theorem 10 and Theorem 12 below. Note that 
Theorem 10 indicates how to compute the sets. 


< f, g > in order to apply the coordinate algorithm 
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{J, al, a2, ul} 


Ny 


{ u2, u3, b2, b3, u4, ud, ub } 


{bl,q, L, M, el, e2, b5} 


No 


CK) Cdl, d2, b4} 
Ng 


f cl, c2} 


N, Ne 


Figure 5 


again to 8 . Theorems 11 and 13 show that the 
coordinate algorithm can be used to obtain any de- 
sired coordinate mapping. 

Theorem 12: Let S$ bea valid program specifica- 
Ia DO node of 8 , and let 8 and Q: 

$ +S be constructed by the coordinate algorithm. 


tion, 


Then § satisfies L2, and is a coordinate map- 
ping which satisfies EC, 
Theorem 13: Let §&, S$ be valid program specifi- 


cations, and let 2:$+48 bea coordinate mapping 
satisfying EC which changes only the node I. 


Then $ and Q can be constructed from S$ by 
the coordinate algorithm, 


Concluding Remarks 


General DO Increments 


To handle arbitrary constant DO increments 
we need only change the definition of the set 


<<f, g>>. Let i, ‘bee * be the branch with 


x =fNg. Assume that for each j , the p loop 
is a DO loop of the following form: 


j-l 
p> od * it ; uw , @ 
r=] fF 


po P= + 


where the c and a are integer constants, and 
# is any expression not involving the r - More 
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general DO loops must be put into this form by 
changing the index variable. For purposes of the 


definition, replace a DO SIM v loop by any DO 
loop whose index set contains 3 j° 
I 


Now, define << f, g >> to be the set of 
all (xt, aise x*) € Leng such that there exist 
1 k, _ 9 
P rg: and Qe Sg with (y',...,y) Meng (2) 
- Teng (P) and 
Js 


. , j-l 
d? x x 
r=] 


y o yt 
r 
for each j . 
With this new definition, all of our results 
remain valid in the general case of arbitrary con- 
stant DO increments. 


Further Refinements 


Several refinements of the coordinate 
method to yield more parallelism are possible. For 
example, it is clear that the computation of 
U(I+1, J) ** 2 in statement 10 of Program 1 could 
be done inside a DO SIM I loop. This involves 
first splitting the J loop into two loops. In gen- 
eral, any node in the set & of the coordinate al- 
gorithm is a candidate for splitting. Such refine- 
ments will be described in [5]. 


Practical Problems 
There are many practical problems to be 


solved in implementing the coordinate method for 
a real compiler. We list some of these below. 


1973 SAGAMORE COMPUTER CONFERENCE ON PARALLEL PROCESSING 


sencepttereernerneeernttata RO A ALC CT eee 


Although described separately, they are all closely 
related. The solutions of these problems will de- 
pend upon the particular parallel computer design. 


Choice of the DO Node I. To maximize 
parallelism, by Theorem 11 we would apply the 
coordinate algorithm successively to each DO 
node, working up from the bottom of the tree. How: 
ever, this may produce more parallelism than can 
be exploited by the computer. Some procedure is 
needed to choose the nodes to which the coordin- 
ate algorithm should be applied. 


Choice of the Tree Partition P. Maximiz-~ 
ing the parallelism does not necessarily produce 
the best program. In our example, we assumed an 
algorithm clever enough not to put b4 into its own 
separate DO SIM I loop. However, we might have 
done better to further decrease the parallelism by 


putting u6 in the I, loop, eliminating the need for 
TMPS. 


Translation of the Specification. It is 
necessary to translate the specification into either 


FORTRAN or some intermediate language from 
which the compiler can generate code. Our exam- 
ple indicates that conditional branches can al- 
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always be the best procedure. 


ways be handled by converting to conditional 
assignment statements. However, this will not 
Other improve- 
ments are also needed. E.g., in Program 2 we 
can replace TMPI1 and TMP3 by new occurrences of. 
P and Q, 


Our example shows how complicated these 
problems can become. However, most real pro- 
grams are simpler, and simple solutions will usu- 
ally be good enough. For example, we might al- 
ways choose P to consist of a single set. The 
coordinate algorithm would then simply try to re- 
write the program with a single DO SIM I loop. 


Conclusion 


We have presented a method of detecting 
parallelism in sequential programs which gener- 
alizes several previous methods. It forms the 
basis of a sequential to parallel conversion phase 
of a compiler for a parallel array or vector com- | 
puter. The techniques employed - particularly the 
use of the <f, g> sets and the relation = - 
should be applicable to other areas of program op- 
timization. 
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MODELING FOR PARALLEL COMPUTATION: A CASE STUDY 
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Abstract 


A methodology to build models for parallel 
computation for some special class of algorithms, 
namely compilation, is presented. The control 
graph component of the model is of the extended 
Petri Net form with switches and token absorbers. 
A detailed example of the use of the graph is 
given and some formal properties such as conserv- 
ation of token and proper termination are proven. 


IL. Introduction. 


In recent years, we have seen the emergence 
of numerous graph models for parallel computa- 
tion [3]. Depending on the investigators" back- 
grounds (engineers, logicians, mathematicians) , 
the objectives of the models have been varied 
(e.g. coherent design of modular parallel sys- 
tems, correct flow of control in the execution 
of parallel algorithms, relations between sequen- 
tial programs and their parallel representations, 
prediction of cost and performance of multipro- 
cessors). 


In this paper, we present (first in Section 
II) the criteria which have led us to select 
some particular node and arc primitives and 
graph properties for the modeling of a specific 
class of algorithms, namely parallel compilation. 
The choice of this test vehicle for our modeling 
methodology is motivated by the following obser- 
vations. First, techniques to handle automat- 
ically the detection of parallelism are best 
suited for high-level languages and scientific 
applications and do not carry over well for comp- 
ilation, which has most often been considered as 
a sequential process. Therefore some "human 
insight" appears necessary. Second, this will 
oblige us to try and uncover some parallelism in 
the compilation process through algorithm modi- 
fication, changes in data structures, redundancy, 
etc. Finally, assuming the efficiency of multi- 
processors at run-time, means must be found to 
use them efficiently at compile time. 


In Section III, we apply this modeling meth- 
odology to an example taken from the compilation 
process. Different stages in the modeling are 
successively introduced. They show the impor- 
tance and the need for the features introduced in 
Section II. 


Section IV defines formally the graph model 
and some of its properties. It is shown how the 
latter can be derived through techniques resem- 
bling those used in the theory of formal lang- 
uages. 
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II. Graph Primitives and Model Properties. 


In the rest of this paper, we assume the 
reader's familiarity with the basic concepts of 
graph theory. 


1. Places and Transitions of the Control Graph. 


Like any other algorithmic process, compila- 
tion has three components: control, computation 
and data. We shall separate the modeling into a 
control graph (control of operators) and a data- 
flow graph (action of operators on data). In 
this paper, we investigate the control part. 


The amount of interpretation that is pre- 
sent in a model depends mostly on the goals that 
one wants to achieve in the modeling process. 

If the primary objective is to describe specific 
algorithms or systems, then a total interpreta- 
tion will be most convenient. Adam's Computa- 
tion Graph [1] is such an example, and it can 
be regarded as a parallel programming language. 
On the other hand, if the derivation of general 
formal properties and the characterization of 
parallel algorithms are the main goals, then un- 
interpretation is necessary and schemata have to 
be introduced [6]. In our case, we are dealing 
with compilation considered as a class of al- 
gorithms and not with the modeling of a partic- 
ular compiler. Hence, we will not choose total 
interpretation. At the same time, we wish to 

be able to retain some descriptive power and we 
have to rule out complete uninterpretation. 
Therefore our model is partially interpreted. 
Most of the interpretation takes place in the 
data graph, but some nodes/arcs of the control 
graph possess specific meanings. 


One can view compilation as a general pipe- 
line process, namely: 


lexical analysis > syntax analysis > 
code generation. 


The unit of information flowing through the 
pipe can vary widely in size. For example, one 
could choose a subprogram, a block, a statement, 
a lexical or syntactical entity. Furthermore, 
each element of the pipe can be broken into a 
number of substages with appropriate latches. 

As we shall see in the next section, this pipe- 
line concept can also exist at very fine levels 
of detail. Independently of the size of the 
unit of information, a "token machine" is appro- 
priate to represent pipe-line,flow. Therefore 
the control graph is based upon the Petri Net 
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concept [5,9]. The formal definition of the 
graph being given in Section IV, we recall here 
only that a Petri Net is composed of a set of 
transitions (corresponding to events and denoted 
by | in figures), a set of places (correspond- 
ing to the holding of conditions and denoted Q ), 
and a set of directed arcs linking (input) places 
to transitions and transitions to (output) places. 
Places are able to receive tokens which mark the 
holding of conditions. A place without token is 
empty; otherwise it is full and the presence of 

a token will be shown by in the figures. An 
event can occur (equivalently a transition can 
fire) if all its input places are full. After 
the firing, a token is removed from each input 
place and a token is added to each output place. 
We do not allow a place to be input and output to 
the same transition in order to "clarify" the 
description of holdings. Figure 1 shows a two- 
stage pipe-line process modeled by a Petri Net. 
When place 1 becomes full, stage 1 can be ini- 
tiated through the firing of transition a. When 
stage 1 is completed, transition a, will fire 
and the latch will become full. Now transition 

b can fire, allowing the processing of stage 2 


and the possibility for transition a _ to fire 
anew if place 1 becomes full again. Thus, stages 
1 and 2 can be active simultaneously. This part- 


icular instance of a pipe-line is built in such 

a way that stage 1 has to wait for the initiation 
of the i computation of stage 2 before being 
able to initiate its own (i+1) computation. In 
Figure 2, it is shown how a buffer can be intro- 
duced between the two stages (the buffer here 
being of size 2). 


Figure 1. Modeling a Pipe-line Process with 
Petri Nets. 


Figure 2. Increasing the size of Buffers between 
Stages. 
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Petri Nets in the above form are not easily 
amenable to represent predicates. Extensions 
using conjunctive logic as above and disjunctive 
logic, i.e. EOR [2], have been used to enhance 
the descriptive power of models [7] while, at the 
same time incurring no loss in some formal prop- 
erties [4]. We follow the same approach here, 
denoting by + the presence of an EOR condition 
at either the input or output of a transition 
(cf. Figure 3). The conjunctive logic, i.e. AND 
condition, is the assumed default situation. We 
forbid mixed logics since we can always realize 
the desired boolean condition with the inclusion 
of appropriate "dummy" places and transitions 
with simple logic. 


+ + 
5 Ou~O 
ey © C4) 


(a)Input Disjunctive 
Logic. Only one of the 
input places can be 
full. Then a can 
fire. 


(b)Output Disjunctive 
Logic. After firing of 
a, only one of the out- 
put places will receive 
a token. 


Figure 3. Disjunctive Logic. 


Although the EOR and AND logics have suffi- 
cient properties to show the flow of control in 
algorithms, we introduce nevertheless a new type 
of place that we call switches. Switches bear 
some analogy with the construct of the same name 
found in programming languages and also with 
Nutt's resolution procedures [8]. However, their 
actions are purposely more restricted than these 
procedures so that their presence will not des- 
troy formal properties of the model. As any 
other place, a switch can either be full or empty. 
The presence or absence of tokens in a switch does 
not influence the firing of the transition for 
which it is an input place; that is to say the 
conditions for firing are tested on the set of 
input places from which the switch has been re- 
moved. A transition which has a switch as one 
of its input places (and there cannot be more 
than one switch per transition) is necessarily 
of EOR-output logic with only two output places 
corresponding respectively to a full switch 
(branch f) and to an empty switch (branch e). 
Figure 4 illustrates these concepts with the 
switches denoted by. As we shall see in the 
next section, switches allow flexibility and 
short cuts in the modeling of algorithms. 
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(a) Empty switch 


(b) Full switch 


Figure 4. Illustration of the Firing of a 
Transition with an Input Switch. 


The transformation of a sequential program 
into parallel form might introduce some justifi- 
able redundancy. However, processing which has 
become useless should not be allowed to be car- 
ried on and to tie up resources that the rest of 
the system might need. This point is illustrated 
by the following example. We want to search a 
linear table for a given key and two processors 
can be available. Hence, we desire processor 1 
to start at the low end of the table with indices 
being incremented and processor 2 at the other 
end with decrementing indices. As soon as one of 
the processors has found a matching entry, both 
computations should be terminated. Moreover, if 
processor 1 was started first and had found the 
match and processor 2 was not yet initiated, it 
is evident that processor 2 should be prevented 
from performing a useless task, and vice-versa. 
In our modeling process, we use arcs which are 
token absorbers to represent this situation. 
Token absorbers also permit token conservation, 

a property needed, as we see below, for modeling 
pipe-line processes. 


A token absorber is a multiarc with one 
head (a transition) and one or more tails 
(places). When the transition from where the 
head originates fires, tokens are removed from 
each of the full tail places. Figure 5 shows 
how this cancelling occurs for the previous ex- 
ample. Transitions a and b correspond to the 
comparison process; places C and F are the 
conditions of no-matching; E and H also corres- 
pond to no-match but in supplement they indicate 
that the ends of the (half) tables have been 
reached, and D and G correspond to a match. 
When either c or d fires, say c for exam- 
ple, tokens which were possibly present on B, F, 
G and H are removed. (In this example the pre- 
sence of a token on one of these places implies 
the emptiness of the three others.) If we hada 
match on both processors, because of duplicate 


elements in the table, c and d could be ter- 
minating at the same time. By convention, simi- 
lar to the realization of an interrupt scheme 
without priority, two transitions cannot fire 
simultaneously. If the two signal completion at 
the same time, one of them will be chosen arbi- 
trarily as the first one to finish. Therefore 


I and J cannot hold tokens simultaneously 
(EOR-input logic at transition e), and K 
corresponds to a match in the search process. On 
the other hand, transition f will fire when 
both processors report no success. 


Figure 5. Illustration of the Use of Token Ab- 
sorbers. 


2. Execution Sequences and Properties of the 


Graph. 


A control graph with places and transitions 
as above cannot describe a computational process 
per se. A meaning must be given to places and 
transitions. A first element of this semantic 
attachment is the data-flow graph associated with 
the control flow graph. A transition of the con- 
trol can be linked to an operator in the data- 
flow graph. Each operator takes its inputs from 
a range of memory locations, performs a function 
and outputs values in a domain of memory loca- 
tions. Furthermore, if the transition is of EOR- 
output logic, the operator indicates the output 
place on which a token is to be placed. An in- 
terpretation of the model consists of defining 
the data graph in terms of specific memory cells 
and their initial values, the operators' func- 
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tion, ranges and domains, as well as an initial 
marking of the control graph. The latter indi- 
cates which places are initially full, and the 
number of tokens on each place. In the sequel 
we will describe markings by the name of places 
which are full. The name of a place will occur 
as often as the number of tokens it holds. 
Starting with this initial marking, an execution 
sequence in the control graph is a sequence of 
transition firings. In the example of Figure 5, 
two out of the possible execution sequences, with 
an initial marking of a token on S, are: 


sabb' ba' ace 
saa'bb' aba' b' baf. 


After each transition firing, the graph is in a. 
new state, or marking. The execution sequence 

can also be given in terms of sequence of mark- 
ings. For the two above, we have respectively: 


S,A4B,BC,CF,BC,CF,AF,DF,I,K 
S,AB,BC,AB,AF,AB, BC,CF,BF,AB,AH,EH,L . 


If the execution sequence is finite, the last 
State reached is called a terminal marking. 


The control graph should be constructed in 
such a way that given an initial marking M, and a 
set M of goals, i.e. terminal markings, all 
execution sequences starting with M, should be 
finite, reach one of the members of M and no 
other transition should be able to fire. This is 
akin to Gostelow's proper termination [4] and re- 
sembles strongly the acceptance of strings by a 
finite state automaton. For the example of Fig- 
ure 5, an initial marking M, of a token on S&S 
and a set of goals M = {K,L¢ yields proper 
termination for the graph, if one forbids infinite 
looping through transitions a,a' and b,b'. The 
rationale for this restriction will be explained 
in the following section. It is to be noticed 
that Mo and M are at the discretion of the 
model builder, but proper termination is indep- 
endent of the data graph and of the operators' 
functions. In supplement, since our model is 
oriented towards the representation of pipe- 
lining, another important property, namely the 
conservation of tokens, should be considered in 
conjunction with proper termination. More pre- 
cisely, stages in the pipe-line have to be re- 
usable after each activation. Therefore the 
initial and terminal markings should differ only 
by the presence of tokens on places which re- 
ceive or deliver tokens from or to other stages. 
(The foremost stage as well as the last one con- 
stitute the environment or outside world [8,9]). 
The presence of token absorbers becomes very 
useful for the realization of this constraint. 


An important criterion to judge the formal 
power of some graph models is the determinacy 
condition [3,6]. A model is determinate if the 
sequence of values associated with each memory 
cell is unique. In our case, determinacy in- 
volves the analysis of the data graph. But, 
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because of the token absorbers, one can already 
see that determinacy cannot be achieved here. 
Therefore our goal will only be to obtain I/0 
determinacy, which is the property defined by 
the fact that for an initial set of input values 
all exe aition sequences will yield an identical 
set of output values. We shall not elaborate on 
this point, since the scope of this paper is res- 
tricted to the control structure of the model. 
In a similar manner, we define the I/O equiva- 
lence of two control graphs associated with a 
common data graph as the property of the two 
graphs to be determinate and to yield identical 
output values for identical initial values. 


The programming of a large system should be 
modular. This structure has to be reflected in 
the model. Therefore, we need to be able to con- 
nect subgraphs. As seen above, the property of 
conservation of tokens allows the linkage of 
stages in the pipe-line as shown in Figure l. 
Subroutine calling is modeled, in the control 
graph, by application of an ALGOL~like copy rule 
[1]. Although other techniques have been pro- 


-posed [4], none of them applies to reentrant sub- 


routines, the only kind with which one is con- 
cerned while writing compilers for multipro- 
cessors. 


Finally, one objective of the modeling is 
to ascertain the amount of parallelism that one 
could achieve. The first element of parallelism 
is in the pipe-lining process. The second is in 
the potential concurrency within each stage. 
Therefore, one characteristic of a stage is its 
maximum parallelism, i.e. the maximum number of 
transitions which are ready to fire simultan- 
eously. For the example of Figure 5 this number 
is 2. 


III. An Example of the Use of the Model. 


To illustrate our approach as well as the 
use of the model, we consider the following ex- 
ample: During the lexical analysis phase of the 
compilation, it is known that either an identi- 
fier or a reserved word is going to be scanned 
as soon as the first character of a lexical en- 
tity has been recognized as a letter. The finite- 
state automaton, translated in extended Petri Net 
form, "flow-charting" this simple algorithm is 
shown in Figure 6. 


Figure 6. Lexical Analysis: the Obvious Approach. 
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The places have the following meanings: 
1 : the first character is a letter 
2 : ready to scan next character 
3 : character is either a letter or a digit 
4 : separator (i.e. end of lexical entity) 
5 : lexical entity is an identifier 
6 : lexical entity is a key word. 


The transitions correspond to the following ac- 
tions: 


a,b : scan next character (example of the 
copy rule for reentrant subroutines) 


c : dummy procedure to prevent place 2 
from being both input and output to tran- 
sition a. 


d : look-up the table of reserved words. 


If during the scanning a digit is encoun- 
tered, the lexical entity cannot be a reserved 
word. Therefore, we introduce a new output place 
to transition b with the new meanings: 


3 : character is a letter 
7 : character is a digit. 


If place 7 becomes full, then the lexical entity 
cannot be a reserved word and transition d 
should never be activated. This is accomplished 
in the graph model by the introduction of switch 
9 which becomes full after transition e has 
fired and the latter fires every time a digit is 
recognized. When a separator is encountered and 
place 4 becomes full, transition f fires, tak- 
ing the f£ branch if a digit had been encoun- 
tered, the e branch otherwise. Only in this 
latter case does place 10 become full and allow 
transition d, i.e. the reserved word search, to 
be activated. However, two defects are apparent 
in this graph (Figure 7): 


-Tokens are going to accumulate on switch 
9. When place 5 is reached, the number 

of tokens left on 9, if any, is the number 
of digits encountered minus one. Hence, 
we need to either introduce token absorb- 
ers or change the logic of the graph. 


-No parallelism is yet apparent. 


Figure 7. Introduction of a Switch. 


Figure 8 shows how this latter weakness is 
taken care of. As soon as place 1 has been 
reached, the search in the reserved word table 
could be initiated if the latter were ordered. 
For example, we could find pointers to the begin- 
ning and end of the subtable corresponding to the 
letter scanned in place 1. To that effect, tran- 
sition d is split into: 


d : find begin and end pointers 


and g : finish the search for the whole lexi- 
cal entity, 


with places 11 and 12 initiating these transi- 
tions. We could even refine further by allowing 
switch 9 to be an alternate output to transition 
d in case that there exists no reserved word 
Starting with the letter scanned in place l. 
However, our main point here is to show a poss- 
ible concurrency between the scanning process 
(transition b) and the search process (transi- 


tion d). 


+ 
age 


Figure 8. Introduction of parallelism with 
accumulation of tokens. 
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Finally, we have to "clean up" the graph ~X=PuT is a finite set of vertices with 
so that it terminates properly and shows conser- 
‘vation of tokens (cf. Figure 9). In order to P = {p,,p,;---,p_} being a finite set of 
prevent accumulation of tokens on switch 9, two 1er2 * 
new places: places; 

13 : the first digit is encountered T = {ty styseeest } being a finite set of 

m 
14 : a digit (not first) has been recog- transitions; 
nized 


S a possibly empty proper subset of P is 
and a dummy transition h are placed between 


place 7 and switch 9. Another switch, 15, a set of switches. 
directs the output of h on either 13 or 14 and 


is filled at the beginning of the execution of -A=IUOUN isa finite set of arcs with 
the graph. i is a dummy transition between 14 


and 8 introduced for the same reason as was c. I = {(p,,t.) | p. € P, t, « T} being the 
With this logic, switch 9 will receive at most ts * J 
one token. If for a particular execution se- input arc set and p, being an input place 
quence switch 9 remains empty, then switch 15 - 
will be full when place 6 is reached. Hence a to t. 
token absorber is sent from transition g to J 
switch 15. Finally, the firing of transition e O={(t,.p,) | t, « T, p, « P} being the 
removes any token present on either place 11 or ae, - J 
place 12 via a multiarc token absorber. Thus, output arc set and p, being an output 
if a digit is encountered before the first table 
searching, this latter computation is cancelled. ‘place to t, 
i 


N = {(t,; [Py >Pyo++- Py) | tc. € Fs PioPse- 
+P) € P} being the token absorber set 
and Pears being the cancelled places 

- C is the control which associates with each 
transition a pair of logics, i.e. one of the 
possible combinations {(AND,AND), (AND,EOR), 
(EOR, AND) , (EOR, EOR) }. 

The following topological restrictions are 

imposed. If Piet) e I, then (t5>P5) é¢ 0. 
If P; € S and (p, >t.) e I, then no other 


switch is an input place to ty Ys is either 


of (AND,EOR) or (EOR,EOR) logic, and there are 


only two output places to ty. the two output 


arcs leading from t being labeled respectively 


Figure 9. The final Graph for Lexical Analysis. 


f and e. 
We have applied the same technique to 


other detailed algorithms with success. The 2. Tokens, markings and firing expressions. 
"cleaning" of the graph is greatly facilitated 
by the procedure used to check for proper termin- A place p. is full if it holds at least 
ation as presented in the next section. one token. Othé@rwise it is empty. The set P 
and the number of tokens associated with each of 
IV. Formal Definitions and Properties. its elements constitute a marking. Equivalently 


it can be represented by a multiset or bag [4]. 
1. Places, transitions and arcs. 
The firing of a transition is controlled 
The control graph is a triple P= (X,A,C) by the presence of tokens on its input places as 
where: well as by its logic. The latter also directs 
the outcome of the firing. The possible 
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situations are summarized below by firing expres- 
sions [4] for a transition a with the following 
conventions: 


Di. GDal sae eRs Dy input places to transi- 
i i i 
1 2 n 
tion a 
P_ »P_ 5ee+5P output places to transi- 
Oo fe) fe) 
1 2 m 
tion a 
P,P, cc eeP, cancelled places by tran- 
tne’ q 
sition a and p shows the absence of 
Pa 
tokens 


p._is a switch and p_ and p the output 
s fe) fe) 
e f 
places to transition a in that case 


-(AND,AND) logic: p. p, ..-p., * DP. P_ ee. 
_ i ee) Sg ee Ce) 
PP. Pp. --+-P 
oO, fy By a 
(AND,EOR) logic: P; Py -++P; > Di Pas De 7+°P) 
12 n A ee Gua’ q 


4° > Ly2eec em 


or if a switch is present 


a ces he Aare res 6 


(EOR,EOR) logic: Py > Po Pa Pn eo Pn 
j k “1 “2 q 


j a 1,255.5 N35 k = Li Zsecs 7m 
or if a switch is present 


Pee Py Pa Pe eee j = 1,2,...,n 
j e 1 2 q 


Pp. P.* Pp P P ee PS 125s eeen 
1 Ss OF ny) Ny aael 


The firing expressions for the graph of Figure 9 
are shown in Figure 10. 


1>+211 15 715413 
2+ 3 715 +14 
2>4 8 > 2 
2>7 10 12+5 15 
332 10 12 + 6 15 
4935 11 > 12 
49> 10 13 +8911 12 
14 > 8 


Figure 10. Firing Expressions for the 
Graph of Figure 9. 


3. Execution Sequences and Proper Termination. 


A given marking indicates which transi- 
tion(s), if any, can fire. After firing of one 
transition, a new marking is generated accord- 
ing to one of the firing expressions of the fired 
transition. Starting with an initial marking 
M_, the sequence of the transition firings (or 
equivalently of the generated markings) is called 
an execution sequence. A marking from which no 
transition can fire is called a terminal marking. 
For a given P and M _, we are interested in the 
finiteness of the execution sequences as well as 
their terminal markings. Thus, we also define a 
set of goal terminal markings M. We consider 
now graph executions as the triple (P,M _,M). By 
definition, a graph execution has the property of 
token conservation if it is properly terminating 
(cf. below) and, if for every terminal marking 
M. € M, the set of full places is composed ex- 
clusively of places which either belong also to 
M - with the same number of tokens - or for 
which there is no transition admitting them as 
input places. 


Before defining the concept of proper term- 
ination, we need to introduce two other proper- 
ties of the graph, namely: 


-A control graph P is k-safe if places 
can hold at most k_ tokens (a 1l-safe graph 
is simply called safe). 


-A graph is repetition-free if the domain 
of (data) operators associated with (AND, 
EOR) and (EOR,EOR) logic transitions is 
modified between two consecutive firings of 
the transition L4,6]. 


Now, a k-safe, repetition free graph execu- 
tion (P,M _,M) is properly terminating, if, 
for all inPerpretations and all execution se- 


quences, if a terminal marking is reached, then: 


-No place will ever receive more than k 
tokens}; 


-The terminal marking is a member of UM. 
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-All members of M can be reached from M, 
by some finite execution sequence. 


Theorem: There exists an effective procedure 

to determine if the execution (BM _.M) of a k- 
safe, repetition free graph P is property 
terminating. 


Proof: The proof is by construction and resembles 
that of the word problem in automata theory. 


Let |P{| be the number of places in the 
graph. If P is k-safe, rhe number of allowable 
markings is bounded by gk Pl, Prey ce now the 
State graph consisting of the 2kIPl states (al- 
lowable markings) and of a dead state A  corres- 
ponding to a tentative firing of a transition 
which would fill a place with more than k 
tokens. By convention no state can be reached 
from A. We construct the connections between 
different states as follows. We start with M 
and build the set M_ as the set of states 
which can be reached from M) by the firing 
of one transition. We link -M_ with members of 
M _, each link (or arc) being 18beled with the 
name of the transition. We repeat this process 
with each element of M_ yielding M, and the 
labeled arcs between elements of M and M'.-: 
Then the set M, is defined by 


=~ 1 
M, = M i" (M, U {M}). 
At step i, i.e. upon reaching M._y the con- 
struction is as follows. Let M'. be the 
set of markings which can be reached from an 
element of M, , by firing of a single transi- 
tion. We connect elements of M, , with 
their appropriate successors in M'. and’ deter- 
mine M. by = 


ae |. ae 
M. M ; (M4 U M. 5 Unset M U {M}). 
Since the number of states is fi bre this pro- 
cedure halts for some j, j <€ gk P » such that 


M. =o. Now, let M' be the set of markings 
bdlonging toa M,, 0 < i< j from which no 
other marking can’ be reached. The graph is pro- 
perly terminating if and only if M' = M and 
there exists a path from any state belonging to 
some M, to at least one member of M. This 
latter condition is checked easily by some "suc-— 
cessor" algorithm. The necessary condition is 
evident. If M' > M, then there exists a term- 
inal marking which was not in the set of goals. 
If M' cM, then some goal can never be reached. 
If some state, reachable from M_ cannot reach a 
member of M, then the execution sequence cannot 
terminate. The sufficient condition stems from 
the repetition free property which states in ef- 
fect that every possible path constructed above 
will be taken for some interpretation and exe- 
cution sequence. 


Q.E.D. 
Figure 11 shows the state diagram and con- 


nections for the execution (P,1,{5,6}) of the 
safe graph P of Figure 9. States belonging to 
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M' have been noted ©) ; 


Figure 11. Procedure for Proper Termination. 


It is worthwhile to remark that the above 
procedure allows also: 


-The determination of the value k for 
k-safety; k is the maximum number of 
repetitions of a place in any marking. 


-The determination of the number of transi- 
tions which can fire simultaneously, i.e. 
the maximum parallelism. We explain the 
process informally here by the example of 
Figure 11. | 


We write the execution sequences leading 
from M_ to the other reachable states as se- 
quences of transition firings. We only consider 
paths between states which lead from a state in 
some set M, to a state in some other set 
M.,j>i. From a given marking, a boolean expres- 
sion - sum of products - indicates the possible 
connections. When a product is present, it 
shows possible concurrency in the firing or two 
(or more) transitions. An execution sequence is 
made up of concatenations of such expressions. 
From the example of Figure 11, we have the devel- 
opment: 


E =a 


e (unique transition firing possible) 


E 


1 ab U ad u a(b n d) 


(either b or d 
(2 11 15)) 


or both can fire from 


The concurrency (here bond) is recognized when 
the firing expressions for two transitions have 
mutually exclusive left hand sides and these left 
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hand sides are subsets of the same marking. Now 
from E,, we obtain E, by considering the tran- 
sitions out from each State of M, and we ex- 


pand appropriately the terms by recognizing 
which term (i.e. path) led to each marking. For 
example, (3 1115) has been reached from ab 


and also from a(bn qd). 
E, = ab(c udu (cn d)) u ab(d UE vu 

(d n £)) u ab(d u hu (dn h)) u adb 

U a(b(c udu (cn d)u a(b(d u £ u 

(dn £)) nd) U a(b(d u hvu (dn h)) n d) 

Ua(b n db) 

We consider next the union operators as distri- 
butive since they correspond to distinct paths. 
Thus, we expand E, while at the same time sup- 
pressing from it those expressions which cor- 


respond to paths leading uniquely to markings in 


M, and M, - It yields 


E, = abeU abd u ab(e-rd) u abf u ab(dn £) 
U abh u ab(d n h) u adb-U aber d) u 
a(bd-d) u a(b(ca-dyn d) u a(bf nd) u 
a(b(d-f) nd) uv a(bh n d) vu a(b(dewrh) n 
d) uv a(b db) 


The terms which are crossed are cancelled for the 
following reasons: 


- abc, a(bc n d) and ab(cnd)_ because 
they lead to markings belonging to M 


and M_, or to the same marking as 
abd. 

- adb because it leads to the same marking 
as abd. 


- Terms of the form a(8x n yx), where 
a, & and y are subsequences, are ex- 
panded into oa(8 nN yx) U a(Bx n y) since 
the firing of transition x cannot be 
duplicated. For example, a(b(dn f£) n d) 
becomes a(bf nd) u ab(d n f) and these 
last two terms are already present in Eo: 
Hence, we obtain: 
E, = abd u abf u abh u ab(d on f) u ab(d n h) 
U a(bf n d) vu a(bh na d) 
Continuing this process, we will have: 
E, = abdf u abdh u abhe u abh(d n e) u 
ab(d n he) 


Ey, = abdfg u abhec u abh(d n ec) vu ab(d n hec) 
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Ey 


BG 
Now, the maximum parallelism MP is the maximum 
number of elements that are linked by an n 
sign in any given term belonging to an E.- 
this example, MP = 2. 


abhecb u abh(d n ecb) u ab(d n hecb) 


abhecbh u abh(d n ecbh) u ab(d n hecbh) 


In 


4. The Reduction Procedure. 


The number of steps in the above procedure 
grows exponentially with the number of places in 
the graph. In a recent paper [4], the U. C. L. A. 
group has shown how this procedure could be shor- 
tened for a certain class of graphs of which our 
graph without token absorbers and switches is a 
subclass. This reduction procedure consists of a 
selective substitution of markings appearing on 
the lefthand side of the firing expression by the 
corresponding righthand sides. It can be shown 
that only a slight modification to the process is 
necessary to apply equally to the graphs we have 
defined above. The term reduction is used since 
the number of sets M., as well as their cardin- 
alities is diminished through the activation of 
the procedure. A few steps of the process applied 
to the graph of Figure 9 are shown in Figure 12 (a) 
as well as the resultant state graph. 


Reducing 2 
1+3 1115 49 +5 8>7 
1>+4 1115 49 + 10 10 12 > 5 15 
1+71115 715 + 13 10 12 +615 
3>3 715714 11712 
3+4 8 > 3 13 +89 11 12 
3> 7 8-4 14 + 8 

After reducing 3,8,11 and 14 
1>+4 1215 715+13 1012+6 15 
1+7 1215 715 +4 13 +4912 
49>5 715>7 13 +7912 
49+ 10 10 12 +5 15 

Final reduction 
1>+4 1215 715413 10 12 > 5 15 
1+71215 715+7912 1012+615 
4935 715>+4 
49 + 10 71537 


Figure 12(a). 
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Figure 12. The Reduction Procedure and Terminal 
State Graph. 
V. Conclusion. 

In this paper we have presented a methodol- 
ogy for modeling parallel computations by graph 
models. We have shown what features are partic- 
ularly appropriate for a specific application, 
namely compilation. Descriptive aspects (e.g. 
switches), efficiency aspects (e.g. token ab- 
sorbers) and formal aspects (e.g. proper termina- 
tion) were considered. This work is still in its 
early stages, and it might be necessary to intro- 
duce new features in the model. This will be done 
following the philosophy that we have put forward 
in this paper; that is, adjunctions to enhance 
the descriptive power of the model should not be 
made at the expense of destroying some formal 
properties, and, conversely, formal properties 
should not be sought if they do not relate to the 
application at hand. 
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Abstract -- This paper reports the results 
of a measurement of parallelism at the statement 
level in 86 FORTRAN programs. The amount of par- 
allelism fs determined by an analyzer program and 
is measured in terms of speedup over serial exe- 
cution, the number of independent processors re- 
quired, the efficiency of parallel execution and 
other measures. 

The analysis techniques are only sketched in 
this paper, details may be found in the refer- 
ences. We also outline some machine organization 
assumptions. 


Introduction 


In the folklore of computer architecture 
there has been much speculation about the effec- 
tiveness of various machines in performing vari- 
ous computations. While it is quite easy to 
design a machine (or part of a machine) and study 
its effectiveness on this algorithm or that, it 
is rather difficult to make general effectiveness 
statements about classes of algorithms and ma- 
chines. We are attempting to move in this direc- 
tion and the present paper contains experimental 
measurements of a rather wide class of algorithms. 
Such measurements should be quite helpful in 
establishing some parameters of machine organiza- 
tion. 

The organization of algorithms and pro- 
gramming for multioperation machines has been 
attacked in a great variety of ways in the past. 
These have included new programming languages, 
new numerical methods, and a variety of schemes 
to analyze programs to exploit some particular 
kind of simultaneous processing. The latter have 
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included both hardware and software devices [5, 
15,27,32,33,35,37]. Multiprogramming often 

formed an important part of these studies. None 
of them apparently tried to extract from one pro- 
gram as many operations as possible which could be 
executed simultaneously. For a comprehensive sur- 
vey of many related results see [3]. 

This paper contains little detail about ma- 
chine organization--we merely sketch some gross 
simplifying assumptions below. Then we outline 
the organization of our program analyzer and dis- 
cuss its improvements over an earlier version 
[23]. A set of 86 FORTRAN decks totalling over 
4000 cards has been analyzed and these are de- 
scribed in general terms. Then we present a nun- 
ber of tables and graphs which summarize our ex- 
periments. These include the possible speedup 
and number of processors required for the programs 
analyzed. Finally, we give some interpretations 
of these results. We conclude that some of the 
folklore has been in error, at least with respect 
to the kinds of programs we have measured. 


Goals, Assumptions and Definitions 


We are attempting to determine for computa- 
tional algorithms, a set of parameters and their 
values, which would be useful in computer system 
design. A direct way of doing this is by the 
analysis of a large set of existing programs. We 
have chosen to analyze FORTRAN programs because 
of their wide availability and because their 
analysis is about as difficult as any high level 
language would be. A language with explicit ar- 
ray operations, for example, would be easier to 
analyze but would restrict our analysis domain to 
array type algorithms. We are attempting to show 
that a very wide class of algorithms can be found 
to possess a good deal of parallelism. The pro- 
grams we have analyzed in many cases have no DO 
loops at all, for example, and most decks have 
less than 40 cards. 

The experiments reported here are a substan- 
tial improvement over those reported in Kuck, et 
al [23] for several reasons. First, we have ana- 
lyzed more than four times as many programs. 
These have been drawn from a wide variety of 
sources as described below and represent a wide 
variety of applications including a number of non- 
numerically oriented ones. Second, in an attempt 
to study the sensitivity of our analyses to mem- 
ory assumptions we have made two sets of runs as 
described later (see Table III). Third, we have 
made several improvements to the analyzer itself. 
These include a new method of handling DO loops 
which we call the vertical scheme, and a new way 
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of treating IF statements within DO loops. These 
are discussed later in this paper. 

In order to interpret the results of our 
analysis, we must make a number of assumptions 
about machine organization. These cannot be dis- 
cussed in any detail here, but most of them are 
backed by detailed study as given in our refer- 
ences. Some are of course idealizations which we 
would not expect to hold in a real machine. Thus. 
the results would be degraded to some extent. On 
the other hand, since our analyzer still is quite 
crude in several respects, we might expect these 
degradations to be offset by better speedups due 
to an improved analyzer. 

We ignore I/0 operations, assuming that they 
do not exist in FORTRAN. We also ignore control 
unit timing, assuming that instructions are al- 
ways available for execution as required and are 
never held up by a control unit. We assume the 
availability of an arbitrary number of proces- 
sors, all of which are capable of executing any 
of the four arithmetic operations (but not neces- 
sarily all the same one) at any time. Each of 
the arithmetic operations are assumed to take the 
same amount of time, which we call unit time. 

Two nonstandard kinds of processing are as- 
sumed. To evaluate the supplied FORTRAN func- 
tions we rely on a fast scheme proposed in 
De Lugish [13]. This allows SIN(X), LOG(X), etc., 
to be evaluated in no more than a few multiply 
times. We also assume a many-way jump processor. 
Given predicate values corresponding to a tree of 
IF statements, this processor determines in unit 
time which program statement is the successor to 
the statement at the top of the tree. With up to 
8 levels in such a tree, the gate count for the 
logic is modest [11,12]. 

We assume the existence of an instanta- 
neously operating alignment network which serves 
to transmit data from memory to memory, from 
processor to processor, and between memories and 
processors. Based on studies of the requirements 
of real programs, some relatively inexpensive 
alignment networks have been designed [22,25]. 

We assume the memory can be cycled in unit time 
and that there are never any accessing conflicts 
in the memory. In Lawrie [25], and Budnik and 
Kuck [7], memories are shown that allow the ac- 
cessing of most common array partitions without 
conflict. Hence, we believe that for a properly 
designed system, accessing and alignment con- 
flicts can be a minor concern and that under 
conditions of steady state data flow, good system 
performance could be expected. For more discus- 
sion see [21]. 


Let the parallel computation time oe be the 


time measured in unit times required to perform 
some calculation using p independent processors. 
We define the speedup over a uniprocessor as 

T, 


where T 
ik b 
P p 


is the serial computation time, 


1 


T 
and we define efficiency as E = Tee which 
| P 
may be regarded as the quotient of - and the 


maximum possible speedup p. As explained in 
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0. mE 


Kuck, et al [23], computation time may be saved 
with the sacrifice of performing extra operations. 
For example, a(btcde) requires four operations 
and a = 4, whereas ab + acde requires five 


= 3. If O 
P 


operations executed in performing some computa- 
tion using p processors, then we call . the 


operations and is the number of 


O 
operation redundancy and let a = Ge 
1 


> 1, where 


1 1° Note that our definition of efficiency 
SS is quite conservative since utilization of 
processors by redundant operations does not im- 
prove efficiency. Utilization is defined as 


O 
= ee where ae is the number of operations 


U 
Pp 
which could have been performed. Using Rs we 
R 0, R qT, 
can rewrite U. as U_ = P-.-b= » and by the 
P p pt pr 
p P 
definition of E_ we have U5 = eos Thus, if an 


observer notices that all p processors are com- 
puting all of the time he may correctly conclude 
that the utilization is 1, but he may not con- 
clude that the efficiency is 1 since the re- 
dundancy may be greater than l. 


Analysis Techniques 


The analyzer accepts a FORTRAN program as in- 
put and breaks it into blocks of assignment state- 
ments, DO loop blocks, and IF tree blocks. Dur- 
ing analysis each hlock is analyzed independently 


of the others and qT T p, and 0, are found for 


each block. Next, we find all traces through the 
program according to the IF and GO TO statements. 


We accumulate Th. T» and 0, for each block in 


each trace to give T,, Ty and O75: The maximum p 


found in any block in each trace becomes p. Ry 


ES sy? and US are calculated for each trace. 
A block of assignment statements (BAS) is a 


sequence of assignment statements with no inter- 
vening statements of any kind. Statements ina 
BAS can be made independent of each other by a 
process called forward substitution. For example, 
A=B+t+C; R=A+D by forward substitution be- 
comes A= B+C; R= B+Ct#D. By using the 
laws of associativity, commutativity, and distri- 
butivity as in Muraoka [30], and Han [16], we 
find the parallel parse tree for each statement. 
The algorithm of Hu [17] is applied to this 


forest of trees to give p. To is the maximum 


height of the individual trees and - is the sum 


of the operations in the forest. This collection 
of techniques is called tree-height reduction. 

An IF tree block is a section of a FORTRAN 
program where the ratio of IF statements to as- 
signment statements is larger than some pre- 
determined threshold. An IF tree block is 


1973 SaGAMoRE CoMPUTER CONFERENCE ON PARALLEL PROCESSING 
eens ereese errant erates etre A Ae N 


transformed into (1) a BAS consisting of every 
set of assignment statements associated with each 
path through the decision tree, (2) a BAS con- 
sisting of the relational expressions of the IF 
statements which have been converted to assign- 
ment statements (i.e., X > Y is converted to 
B = SIGN(X-Y)), and (3) a decision tree into 
which the IF statements are mapped. The tree- 
height reduction algorithm is then applied to (1) 
and (2) combined. Davis [11] shows how to evalu- 
ate an eight-level decision tree in unit time. 
Thus a dual purpose is served: speedup is in- 
creased by increasing the size of the BAS through 
combination of the smaller BAS's between IF state- 
ments, and a number of decision points in a pro- 
gram are reduced to a single multiple decision 
point which can be evaluated in parallel. The 
complete IF tree algorithm is described in Davis 
[11,12]. 

There are two types of parallelism in DO 
loop blocks which can be found most often in pro- 
grams. First, the statement 


DO ltI=#=1, 3 
1 ACI) = ACI+1) + BCI) + C(I) * DCT) 


can be executed on a parallel machine in such a 
way that three statements, A(1) = A(2) + B(1) 

+ C(1) * D(1), A(2) = A(3) + B(2) + C(2) * D(2) 
and A(3) = A(4) + B(3) + C(3) * D(3) are computed 
simultaneously by 3 different processors. Thus, 
we reduce the computation time from qT) = 9 to 


t = 3. This type of parallelism (array opera- 


tions) we will call Type-1 parallelism. If we 
apply tree-height reduction algorithms to each of 
these three statements, we can further reduce the 
computation time to 2 for a 6 processor machine. 

The second type of parallelism lies in state- 
ments such as 


(i) p1itite=1,5 
1 P =P + A(I) 


(ii) piltrt=i1,5 


1 ACI) = A(I-1) + B(T) 


which both have a recurrence relation between the 
output and input variables. In example (ii), if 
we repeatedly substitute the left-hand side into 
the right-hand side and apply the tree-height re- 
duction algorithms to each resultant statement, we 
can execute all 5 statements in parallel, e.g., 
A(1) = ACO) + B(1), A(2) = ACO) + B(1) + B(2), 
~ee, AC5) = ACO) + B(1) + B(2) + B(3) + B(4) 

+ B(5). This will decrease the computation time 
from 5 to 3. For a single variable recurrence re- 
lation as in example (i), we can use the same 
techniques and compute only the last output P 

= P + A(1) + A(2) .«. + A(5) in 3 unit steps in- 
stead of 5. We will call this type of parallel- 
ism Type-O parallelism. 

In order to exploit these parallelisms in DO 
loops, an algorithm described in Kuck et al [23], 
called the horizontal scheme can be used to trans- 
form the original loop into an equivalent set of 
small loops in which these potential parallelisms 
will be more obvious. A modification of that 
algorithm called the vertical scheme has now been 


2) 


implemented. We illustrate these schemes with 
the following example: 
DO 56 IT=1, 3, 1 

S, T(1) = GI) +m 

S5 G(I) = T(I) + DCI) 

83 E(I) = F(I-1) + B(I) 

Sy, F(I) = E(I) + G(I) 

Se H(I) = A(I-1) + H(I-1) 

56 A(I) = C(I) + N 


Due to limited space, we are unable to de- 
scribe the details of the implementation [20,23], 
and we only give the essential parts of the ver- 
tical scheme: 

a) Find the dependence graph among state- 
ments (Figure 1). In the dependence graph each 
node represents a statement; and a path from 
S. to S. indicates that an input variable of S, 


during certain iterations has been updated by 
Ss during the same or an earlier iteration, ac- 


cording to the original execution order. 

b) Separate the dependence graph into com- 
pletely disconnected subgraphs, and arrange each 
subgraph as a DO loop in parallel as shown in 
Figure 2(a). 

c) Apply the forward substitution technique 
to each subloop and the tree-height reduction 
algorithms to all resultant statements. 

After this, the statements can be computed in 


parallel. The required p and T for each subloop 


resulting from use of the vertical and horizontal 
schemes are shown in Figure 2. 

For this example, both schemes give us a nice 
speedup: Ty = 18, “7 = 6 for the horizontal 


scheme and = 4 for the vertical scheme. The 


latter has a better speedup but uses more proces-— 
sors. Note also that the total number of proces- 
sors listed in Figure 2, which is 12 for the 
horizontal scheme and 32 for the vertical scheme, 
can be further reduced by Hu's algorithm [16,17] 
without increasing the number of steps, provided 
that some of the subtrees formed by the resultant 
statements are not completely filled, which is 
usually the case in most programs. 

The basic difference between these two 
schemes is that the horizontal scheme tends to 
facilitate the extraction of Type-l parallelism 
while the vertical scheme helps to find Type-O 
parallelism. At present, we do not have a general 
method of determining, a priori which scheme will 
give a better result for a particular DO loop. 
Although many cases yield the same result using 
either scheme, in some cases a higher speedup 
(with or without lower efficiency of the use of 
processors) can be achieved using one scheme or 
the other. 

IF and GO TO statements increase the number 
of possible paths through a DO loop and complicate 
the finding of the dependence graph, when there 
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are more than a few IF and GO TO statements. We 
find all possible paths and then analyze each 
path separately and call this strategy DO path. 
Thus, when p, 3 etc., are being calculated for 


an entire program we treat each path through each 
DO loop separately rather than combining the 
numbers for each DO loop path into one set of 
numbers that describe the DO loop as a whole as 
was done in Kuck, et al [23]. 


Description of Analyzed Programs 


A total of 86 FORTRAN programs with a total 
of 4115 statements were collected from various 
sources for this set of experiments. They have 
been divided into 7 classes; JAN, GPSS, DYS, 
NUME, TIME, EIS, and MISC. JAN is a subset of 
the programs described in Kuck, et al [23], and 
came from Conte [10], IBM [18], Lyness [26], and 
the University of Illinois subroutine library. 
GPSS contains the FORTRAN equivalents of the 
GPSS (General Purpose Simulation System) as- 
sembler listings [11] of 22 commonly used blocks. 
The DYSTAL (Dynamic Storage Allocation Language 
in FORTRAN [34]) library provided the programs in 
DYS. NUME contains standard numerical analysis 
programs from Astill, et al [2], Carnahan [8], 
and other sources. TIME is several time series 
programs from Simpson [36]. EIS is several pro- 
grams from EISPACK (Eigensystem Package) which 
are FORTRAN versions of the eigenvalue programs 
in Wilkinson and Reinsh [38]. Waste paper bas- 
kets provided elementary Computer Science student 
programs, civil engineering programs, and Boolean 
synthesis programs. These and programs from 
Kunzi, et al [24], and Nakagawa and Lai [31] make 
up MISC. Table I and Figures 3 and 4 describe 
the 86 programs analyzed. 


Results 


The analyzer determines values of Ty» 1 P> 


E S @) R and U_ for each trace in a pro- 
p? p’ p’ 9 P a Pp 


gram. Each program was analyzed separately using 
both the horizontal and vertical schemes of DO 
loop analysis. The results of vertical or hori- 
zontal analysis were then used depending on which 
scheme gave better results for a particular pro- 
gram. The values of Ty: Ty etc., for each trace 


were then averaged to determine an overall value 
for a program T., T etc. Thus, we assume that 


each trace is equally likely, an assumption re- 
quired by the absence of any dynamic’ program 
information. We feel this assumption yields 
conservative values since the more likely traces 
which are probably large and contain more paral- 
lelism are given equal weight with shorter, 
special case traces. Figures 5-9 are histograms 


showing T. /T, Ey Sy u respectively, versus 
the number of programs. | 
The overall program values on 1) etc., are 


“aw “nw 


averaged to obtain ensemble values T,, T , etc., 


1? 
for groups of related programs (see Table I). 


Table II shows these ensemble values for each 
group of programs as well as for all programs 
combined. As we can see, for a collection of 
ordinary programs we can expect speedups on the 
order of 10 using an average of 3/7 processors 
with an average efficiency of 35%. The use of 
averages in these circumstances is open to some 
criticism but we feel it is acceptable in view of 
the facts that the data are well distributed and 
the final averages are reasonably consistent, 


e.g., pE =S . Such anomalies as T,/T > S_ can 
polepo ope oe 1'"p = "p 


be attributed to occasional large T values in 


our raw data. 

At this time we should stress several points 
about our source programs. First, four programs 
were discarded because they contained nonlinear 
recurrence relations and caused analysis diffi- 
culties. Their inclusion would have perturbed 
the results in a minor way, e.g., speedup would 
be low for these four. One was discarded be- 
cause Ty was so large that it effected the final 


averages too strongly (T, = 10953). Second, all 


the programs were quite small (see Table I). 
Third, the number of loop iterations was 10 or 
less for all but one of the programs (where it 
was 20) whose data is shown in Table II. Higher 
speedups, efficiencies, etc., would be expected 
using a more realistic number of iterations (see 
Figures 10-12). Finally, we have not employed 
any multiprogramming, i.e., we do not account for 
the fact that more than one program can be exe- 
cuted simultaneously, (c.f. [11]). Multiprogram- 
ming would of course allow the use of more proces-— 
sors, in general. 

For the results shown in Figures 5-12 and 
Table II, the analyzer accounts for memory stores 
but not for any memory fetches. The effect of 
accounting for fetches is shown in Table III, 
which lists the ensemble values for 65 programs 
run with and without memory fetches. As we can 
see, accounting for memory fetches improves our 
results. In reality, a lookahead control unit 
and overlapped processing and memory cycling would 
perhaps result in numbers somewhere between these 
values. 


Finally, Figures 10-12 show 5 versus T,, P 


versus T,, and . versus p, respectively, for each 


ensemble JAN, GPSS, etc., as well as for all pro- 
grams. Additionally, we took the programs in JAN, 
GPSS, NUME, TIME, and EIS, which had DO loops 
with a variable limit (about 40% of the programs), 
and set the DO loop limits to 10. The resulting 
program values were averaged with all other pro- 
grams in these groups and the final average plot- 
ted in Figures 10-12. The analyses were repeated 
using DO limits of 20, 30, and 40, and the re- 
sulting averages plotted as before. 


Conclusions 
Our experiments lead us to conclude that 


multioperation machines could be quite effective 
in most ordinary FORTRAN computations. Figure 12 
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shows that even the simplest sets of programs 
(GPSS, for example, has almost no DO loops) could 
be effectively executed using 16 processors. The 
overall average (ALL in Figure 12) as shown in 
Table III is 35 processors when all DO loop 
limits are set to 10 or less. As the programs 
become more complex, 128 or more processors would 
be effective in executing our programs. Note 
that for all of our studies, Ty < 10,000, so most 


of the programs would be classed as short jobs in 
a typical computer center. In all cases, the 
average efficiency for each group of programs was 
no less than 30%. While we have not analyzed any 
decks with more than 100 cards, we would expect 
extrapolations of our results to hold. In fact, 
we obtained some decks by breaking larger ones at 
convenient points. 

These numbers should be contrasted with cur- 
rent computer organizations. Presently, two to 
four simultaneous operation general purpose ma- 
chines are quite common. Pipeline, parallel and 
associative machines which perform 8 to 64 simul- 
taneous operations are emerging, but these are 
largely intended for special purpose use. Thus, 
we feel that our numbers indicate the possibility 
of perhaps an order of magnitude speedup increase 
over the current situation. Next we contrast our | 
numbers with two commonly held beliefs about ma- 
chine organization. 

Let us assume that for 0 = B.S 1; (1-8) 


of the serial execution time of a given program 
uses p processors, while B. of it must be per- 


formed on k < p processors. Then we may write 


Pty of 
(assuming 0, = Os ae = gs ae (1-8, ) rp”? 
Ty 1 


and E For 


Pp pete jt BP _ 
7 k 1 kL 1+8, 1) 
example, if k = 1, p = 33, and By = ae then we 


This means that to achieve E, 


1 
have E 3° 33 


33 


= 6 15/16 of T, must be executed using all 33 
processors, while only 1/16 of T, may use a 
single processor. While E = 1/3 is typical of 


our results (see Figure 7), it would be extremely 


surprising to learn that 15/16 of T) could be 


executed using fully 33 processors. This kind of 
observation led Amdahl [1] and others [9,35] to 
conclude that computers capable of executing a 
large number of simultaneous operations would not 
be reasonably efficient, or to paraphrase them 
"Ordinary programs have too much serial code to 
be executed efficiently on a multioperation 
processor". 

Such arguments have an invalidating flaw, 
however, in that they assume k = 1 in the above 
efficiency expression. Evidently, no one who re- 
peated this argument ever considered the obvious 
fact that k will generally assume many integer 
values in the course of executing most programs. 
Thus, the expression for E which we gave above 
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must be generalized to allow all values of k up to 
some maximum. 

The technique used in our experiments for 
computing . is such a generalization. For some 


execution trace through a program, at each time 
step i, some number of processors k(i) will be 

required. If the maximum number of processors 

required on any step is p, we compute the effi- 
ciency for any trace as 


T 
P 
= k(i) 
i=1 
E =-— > , assuming p processors are 
P pRT 
Pp Pp 
available. Apparently no previous attempt to 


quantify the parameters discussed above has been 
successful for a wide class of programs. Besides 
Kuck, et al [23], the only other published re- 
sults are by Baer and Estrin [4], who report on 
five programs. 

Another commonly held opinion, which has 
been mentioned by Minsky [29] is that speedup =. 


is proportional to log, p. Flynn [14] further 


discusses this, assuming that all the operations 
simultaneously executed are identical. This may 
be interpreted to hold 1) over many programs of 
different characteristics, 2) for one fixed pro- 
gram with a varying number of processors, or 3) 
for one program with varying DO loop limits. 
That the above is false under interpretation l 
for our analyses is obvious from Figure 12. Sim- 
ilarly, it is false under interpretation 2 as the 
number of processors is varied between 1 and some 
number as plotted in Figure 12. As p is in- 
creased still farther, the speedup and efficiency 
may be regarded as constant or the speedup may be 
increased at a decreasing rate together with a 
decreasing efficiency. Eventually, as p becomes 
arbitrarily large, the speedup becomes constant 
and in some region the curve may appear loga- 
rithmic. Under interpretation 3, there are many 
possibilities--programs with multiply nested DO 
loops may have speedups which grow much faster 
than linearly, and programs without DO loops of 
course do not change at all. Rather than dis- 
cuss the above any further, we turn to the fol- 
lowing. 

Abstractly, it seems of more interest to re- 
late speedup to Tt than to p. Based on our data, 


we offer the: 


Observation For many ordinary FORTRAN programs 
(with T, < 10,000), we can find p such that 


1 


1) i = alog, T) for 2<a<10, 


a 
6 log,T, 
such that T 
3) 3 > Io log,T, and ee 3 


The average @ value in our experiments was about 
9. However, the median value was less than 4, 
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since there were several very large values. [3] J. L. Baer, "A Survey of Some Theoretical 

A complete theoretical explanation of this Aspects of Multiprocessing," Computing 
observation would be difficult, at present. But Surveys, Vol. 5, No. 1 (March 1973), pp. 
the following remarks are relevant. Theoretical 31-80. 


speedups of 0(T,/1log,T,) for various classes of 

[4] J. L. Baer and G. Estrin, "Bounds for Maxi- 
mum Parallelism in a Bilogic Graph Model of 
Computations," IEEE Transactions on Com- 


arithmetic expressions have been proved in Brent, 
et al [6], Maruyama [28], and Kogge and Stone 


[19]. Many DO loops yield an array of expres- uters, Vol. C-18, No. 11 (Nov. 1969), 
sions to be evaluated simultaneously and this >. 1012-1014. 

T 
leads to speedups greater than o(5 I ) Other [5] H. W. Bingham, E. W. Riegel and D. A. 

Bot) Fisher, "Control Mechanisms for Parallelism 
parts of programs use fewer processors than the in Programs,'' Burroughs Corp., Paoli, Pa., 
maximum and yield lower speedups. However, we ECOM-02463-7 (1968). 
have typically observed speedups of two to eight 
in programs dominated by blocks of assignment [6] R. Brent, D. Kuck and K. Maruyama, "The 
Statements and IF statements, assuming the IF Parallel Evaluation of Arithmetic Expres- 
tree logic of Davis [11]. sions Without Division," IEEE Transactions 

In practice one is generally given a set of on Computers, Vol. C-22, No. 5 (May 1973), 
programs to be executed. If the problem is to pp. 532-534. 
design a machine, i.e., choose p, then the above 
approach is a reasonable one. Alternatively, the [7] P. Budnik and D. J. Kuck, "The Organization 
problem may be to compile them for a given number and Use of Parallel Memories," IEEE Trans- 
of processors. If the number available is less actions on Computers, Vol. C-20, No. 12 | 
than that determined by the above analysis, the (Dec. 1971), pp. 1566-1569. 
speedup will be decreased accordingly. If the 
number to be used is greater than that determined [8] B. Carnahan, H. A. Luther and J. 0. Wilkes, 
above, one must face reduced efficiency or multi- Applied Numerical Methods, John Wiley and 
programming the machine. Sons, (1969). 

We gain several advantages by the analysis 
of programs in. high-level languages. First, more [9] T. C. Chen, "Unconventional Superspeed Com- 
of a program can be scanned by a compiler than by puter Systems,'' AFIPS Conference Pro- 
lookahead logic in a control unit, so more global ceedings, Vol. 38 (1971), pp. 365-71. 
information is available. Second, in FORTRAN, an | 
IF and a DO statement, for example, are easily [10] S. D. Conte, Elementary Numerical Analysis, 
distinguishable, but at run time the assembly McGraw-Hill (1965). 
language versions of these may be quite difficult 
to distinguish. Third, a program can be trans- [11] E. W. Davis, Jr., A Multiprocessor for 
formed in major ways at compile time so it may be Simulation Applications, Ph.D. thesis, 
run on a particular machine organization. All of Dept. of Computer Science, University of 
these lead to simpler, faster control units at I11., Urbana, Rep. No. 527 (June 1972). 
the expense of more complex compilation. 

Finally, we point out that a number of re- [12] ,E. W. Davis, Jr., "Concurrent Processing of 
alities of actual machines have been glossed over n, \conditional Jump Trees," Compcon 72, IEEE 
in this paper. We mentioned a number of these in ¥ Computer Society Conference Proceedings, 


our section on Goals, Assumptions and Definitions. oe (Sept. 1972), pp. 279-281. 
A more detailed discussion of the philosophy of ¢ | “,. 


our analysis work may be found in [21,23]. ‘Y [13] B. De Lugish, A Class of Algorithms for 
Automatic Evaluation of Certain Elementary 
Acknowledgment Functions in a Binary Computer, Ph.D. thesis, 
Dept. of Computer Science, University of 
We gratefully acknowledge the contributions I1l., Urbana, Rep. No. 399 (June 1970). 
of C. Cartegini, J. Claggett, W. Hackmann, D. 
Romine, W. Tao, and D. Wills, who provided pro- [14] M. Flynn, "Some Computer Organizations and 
eramming assistance. This research was supported Their Effectiveness," IEEE Transactions on 
by the National Science Foundation, Grant No. GJ- Computers, Vol. C-21, No. 9 (Sept. 1972), 
36936 and by NASA, Contract No. NAS2-6724. pp. 948-960. 
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Av. # BAS 
Outside DO 


Av. # BAS 
Inside DO 


Av. # 
DO Loops 


Av. # 
Nested DOs 


Av. # IFs 


Av. # | 
IF Trees 


Av. # Traces 


Av. # 
Statements 


Total # 
Programs 


Table II. 
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Average Measured Values With 
and Without Memory Fetches 
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(a) (b) 
Vertical Scheme Horizontal Scheme 


Figure 1. Dependence Graph Figrue 2. Decomposition of DO Loops 
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Figure 3. Number of Programs Versus Number of Cards in Program 
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Figure 4. Number of Programs Versus Fraction DO Loops 
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Figure 6. Number of Programs Versus Dp 
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A LANGUAGE FOR CONTROLLING PARALLEL PROCESSES 


Bill R. Hays 
Computer Science Department 


Brigham Young University 
Provo, Utah 84602 


Summary 


The design of computers with parallel capab- 
ilities, either as complete processors or multiple 
units, has raised the question of how one can take 
advantage of this increased computational ability. 
The paper presents a language designed for control 
of parallel processes. 


There are many formal notations for parallel- 
ism (1,2), but the approach here utilizes formal 
language concepts and reduces the notational com- 
plexity. Parallel computer organization usually 
includes a control state and a similar idea is 
used here. In effect, one has a pushdown store 
automaton (3) controlling or scheduling other 
automatons. Parallel units will be called subac- 
ceptors or acceptors for discussion. The control 
is either local or global with the distinction 
that one global control state can permit a subac- 
ceptor to control another subacceptor (local con- 
trol). The global control state can: (A) communi- 
cate with the subacceptors, (B) sequence subaccep- 


tors, (C) permit local control, and (D) select the 


proper subacceptor and determine if it is avail- 
able. A stack is associated with each subacceptor 
for control and communication. Notationally, if 
Aj is a parallel subacceptor, then a symbol asso- 
ciated with its pushdown control state will be de- 
noted by the subscript A;. The production rules 
of the acceptors for parallel control are of the 
form: (7.95 29 )* (42d. 014%] 9++-9),)> with cr the 


current state, Gn the next state, i the expected 
input symbol(s), ¢, the symbol(s) expected on top 
of the stack, b,. the output symbol(s), rye Ody 


the output to the stack (n>0). The rules are ex- 
ecuted by simultaneously examining the current sy- 
-mbol in the input string and the top symbol Of the 
stack. A production is executed only if both sy- 
mbols are present. Successive rules associated 
with a given state are considered until a rule is 
executed or no production remains (this implies 
one must specifically provide the production rules 
for error conditions). After execution, the scan 
device is moved to the next symbol and a transfer 
is male to the specified state. All of the actions 
do not have to be performed and A\indicates the ab- 
sence of such an action. In practice, one would 
Simply omit it. The rules: (1,049, )>(A; 44) and 


(A, ,A)>(A, 54.) illustrate a transition based on 


reading the stack and no output with the second 
rule representing a transfer with output to the 
stack. Hence, the only required elements are the 
current state and next state. 


Direct control of subacceptors will be accom- 
plished by an associative list. Each element of 
the list corresponds to a subacceptor or state and 
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contains control information. For example, if Aj 
and Ak are parallel subacceptors, then the list 
would contain 22 My as acceptor equivalents. The 


presence of on. indicates an acceptor is available 
will be 
on. 


a special symbol used only for selecting a subac- 
ceptor and is included in production rules as if 
the associative list were a stack. A read selects 
the symbol from a fixed place in the list and a 
write, by the same production, places the output 
symbol in this sublist. Hence, production rules 
accessing the associative list can write only at 
the entry at which it reads. The distinct types 
of production rules are: 

(A) Standard read-only, erase-only, read- 
write, read-erase and read-erase-write rules. (B) 
Control productions of the form: 1- (a,A,dy.)> 


‘ 
and its absence indicates it is busy. 


(X; 55) which reads the associative control list 


and activates the parallel subacceptor Xj, assign- 
ing it the stack 9;. 2- (AAs dy )>(Xi 2 d7 oq) which 
i 


releases ‘A' for further activation by placing op 
back on the associative control list. 3- (A,A,( 


x5 HI ahxG 0G 09) which reads a request for x. 


to process the stack %} and performs the request 
by activating X; and passing the required informa- 


ron. (C) Local control productions of the form: 


1- (WX 5 aq Py )5(V wy (x4 595 OA) which activates 
‘Y' to process the stack o. and requests a return 
to state X;. 2- (WsX4 saqwy oy >(Y w5, dy, ) if no 
waiting is necessary. 

(D) Subacceptor productions of the form: 1- 
(IX; 285% Jaleo ($5 by) oy.) which request the con- 


trol state 'C' to activate " 


to process the push- 
down store $. 2- (A,X; ,ad¢ 


which also requests a return to state Xa. 


The language could be used to write the pro- 
cedures, but it would be expected that only the 
control procedures would be written in the langu- 
age. 
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THE TRANSFORMATION OF FLOW DIAGRAMS INTO MAXIMALLY PARALLEL FORM 
G. Urschler 


System Development Division 
IBM Corporation 


Endicott, N.Y. 


Abstract — The algorithmic transformation of flow 
diagrams into a goto- and variable-free parallel program 
representation is described, It is shown, how the control 
mechanism for these parallel programs works and that it 
exhibits dynamically maximum parallelism in a certain, 
well-defined sense. The method presented is new and 
gives the optimum that can be achieved in intra-task 
parallelism. 


Introduction 
General Introduction 


In an attempt to categorize the types of paral- 
lelism, the following definitions are presented: 


1. Inter-task parallelism. Dependencies between 
concurrently executing work units are allowed. Synchro- 
nization and deadlock prevention techniques are required 


as well as explicit language features for the specification 
of parallelism. 


2.  Intra-task parallelism. No dependencies be- 


tween concurrently executing work units are allowed. 
Required are methods for the automatic detection of 
parallelism. 


3. Parallelism on the hardware level. 


This paper is concerned with intra-task parallelism, 
and the area of particular interest is "maximum" parallel- 
ism. Although it has been proven that the parallelization 
problem is an undecidable one [1], the results presented 
in this paper were possible because of a different under- 
standing of the term "parallelization. " 


Adding of redundancy, for instance, commonly is 
not regarded as parallelization. However in this paper 
also the detection and exploitation of an already exist- 
ing redundancy is not regarded as parallelization, but as 
optimization. Thus the above mentioned proof is regarded 
as a proof for the undecidability of the optimization prob- 
lem and thus not conflicting with the contents of this 


paper. 


Scope of the presented parallelization method 


Core language. The method has been developed for 


an input language consisting of read and write statements, 
assignment statements, and branch and decision statements 
(Flow diagrams). Expressions are restricted to either 
simple data variables (a, b, ...) or to simple expressions 


(a+b, f(a,b,c),...). 
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Extended language. The method obviously also 
applies to each language being translatable into the core 
language. Thus it works for a language additionally 
containing composite expressions, fixed data structures, 
and constant references (A[ 1] for instance, as opposed 


to A[ i] ). 


Extension possibilities. Not described in this paper, 


but known, are extensions of the method to a core lan- 
guage containing subroutines and to the parallelization 
of more than one task. 


Not covered. Not known at the present time are 
extensions to languages involving varying data structures, 
computed references (pointers, subscripted variables with 
subscripts to be evaluated dynamically), and exception 
handling. 


Maximum parallelism 


Based on the above core language, a more precise 
definition of the notion of "maximum parallelism" can be 
given. A statement obviously can be executed as soon as: 


1. all decisions on which this statement execution 
is dependent upon have been resolved (this kind of depen- 
dency is called a control dependency), 


2. all input values required for the statement's 
execution have been generated (the corresponding depen- 
dency is called an input dependency), and 


3. it is known that these input values have been 
generated (the corresponding dependency is called a data 
dependency); the need for the latter case, being more 
subtle than the previous ones, is illustrated in Figure 1. 


Fig. 1 - Flow Diagram Showing a Data Dependency 
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If a resolution of D to the no branch is assumed, 
then the input values for C have been produced before 
execution of D by the preceding A, but this fact becomes 
apparent only after resolution of D and thus C has to wait 
for D too. 


Sequencing constraints caused by control-, input- 
and data dependencies only, are called necessary ones. 
Maximum parallelism now means that the only logical 
sequencing constraints to be followed at execution time 
are necessary sequencing constraints. 


Benefits of the method 


Most of the conventional parallelization methods 
[2],[3] parallelize on a program basis, trying to divide 
a program into independent program blocks. Thus the 
_ parallelism which can be detected inherently is limited by 
the size of the given program. The presented method, how- 
ever, parallelizes on a computation (program execution) 
basis, thus giving the more (potentially infinite) paral- 
lelism, the lengthier the computation is. As byproducts, 
new and highly efficient program analysis methods as well 
as a quite unusual paralle! program concept are developed. 
The latter gives both an insight into the nature of paral- 
lelism on the intra-task level and a certain understanding 
of what a machine exploiting this kind of parallelism might 


look like. 
Paper overview 


The method is illustrated by means of the program 
in Figure 2 (the function y= Vx _ is computed with an 


x=] 

error precision f for the square root calculations). V de- 
notes the program beginning and A the program end. The 
capital letters are used later for the symbolic reference of 


statements and program blocks, respectively. 


In the following, a thorough program analysis is 
made of this program, and based upon this analysis the 
program is translated at first into a "single assignment" 
form (in which each variable is written to, at most, once) 
and finally into a variable-free form. As auxiliary tool 
(regular) production systems from the theory of syntax are 
used, 


Program Analysis 


Control flow analysis 


1. Determination of the program logic. A program 


like the one shown normally is ~ because of the unlimited 
use of branching — a bowl of spaghetti. Thus the first 
step in the analysis is the determination of the logic (the 
structure) of the given program (for a more exhaustive de- 
scription of this step see Reference [ 4]). 
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Fig. 2 - The Source Program to be Parallelized 


As an auxiliary tool the notion of "immediate post 
dominator" is used [ 5] ,[ 6]. For a statement branching 
unconditionally, the immediate post dominator is identical 
to the successor of this statement. For decisions, the 
immediate post dominator is that (uniquely determined) 
statement at which all branches evolving from the 
decision join for the first time. Thus in the given example, 
decision P has the immediate post dominator K and 
decision Q the immediate post dominator Z. If a chain 
of succeeding immediate post dominators is referred to as 
a control flow, the control flow of the given program can 
be described by means of the following "production": 


Vi=XYABCDEFGHIJPKLQZA 


The undefined elements in this production are "modules" 
P and Q, which again are described by productions as 
follows (¢ denotes the empty string; the first alternative 
describes the "true" branch and the second one the "false" 


branch): 
P::= fe | DEFGHIJ P} 


p 
Q:=f{e | MCDEFGHIJ PKLQ} 
qd 
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Productions for decisions are derived by con- 
structing the control flow for each successor of the deci- 
sion and following it as long as the "scope" of the deci- 
sion is not left (which means that the immediate post 
dominator of the decision itself is not yet reached). 
Particularly, it thus can happen — as in the above 
example ~ that a decision alternative becomes empty. 


2. Program reduction. The translation of a flow 
chart into a goto-free form has,in essence, been achieved 
by the copying of program text (and not as in the Boehm/ 
Jacopini method [ 7] by the introduction of "control 
switches"), Thus the new program, in general, becomes 
much larger than the old one. This inconvenience is 
removed in the following by the introduction of abbre- 
viations for lengthy strings occurring more than once. 


This reduces the program to the dimensions of the source 
program. For the given example this results in: 


Vite XO AB oS: Ae 
ere °oC AK LQ 

Rees DBF Go AT Jf FP 
Pir= fel Rt 

ese Le M st 


Data flow analysis 


1. Derivation of local production systems, From 
the above "global" production system, the following set of 
"local" production systems is derived, each of which de- 
scribes the program from the point of view of a single data 
variable only. "MY" has the meaning "Module M as seen 
from variable v", "SR" means "read operation in statement 
S" and analogously "SW" means "write operation in state- 
ment S". Productions for modules not containing a certain 


variable are omitted. Altogether this gives the results 
shown in Table |. 


oie" See ie” Re ante P*::={e| R*4 
VW::= SV SY ace RV sy | PY 22> 46] RY} 
Vee AW gn SP eres We ope ge P™::= fe| Ro 
Vises ov SY ::= RY Qw P PY: ;= fe| R"} 
v8::= 98 SP iee BP 9° PEs: fel 28} 
V o 
vi.:= yW of Bl eee ql wt pi = fe} r'} 
VP ses iP SP::= RP @P PP z= fe| Ret 
VY¥::= BW gy ZR Slane Age ne GY | 
Vd::= g4 Stee O80" 
vue y” om ieee Cae hu 
Rass. De ph: Tag Q*::= { € | s*t 
RY::= DW IpR pW1gR ;W IR py Q¥::= { e | sv} 
RM; ;= 2pR pn Q®::= { e€ | wR yWgal 
Ree Gh ee Q¥::= f e | s*} 
° R8::=5 G@™ *y® 7R ps8 Q8::= { e | ss} 
Riz;= *gRpf ° ais:= { e | si} 
RP::= JWpRpP QP sss { ¢ | sP} 
Q’ ss | e | svt 
QI 2:5 { e | 54t 
Q™::= fe | sm 
Table | - Local Production Systems Describing the Flow of Data for the Given Source Program 
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2. Determination of module interfaces. A local 
riodule of the kind MY defines (in syntactical terms) a 
certain language, the sentences of which are composed of 
read and write operations only. A module is called read- 
like if the corresponding language contains read operations 
only. A module is said to require input (denoted by > MV), 
if at least one of the sentences defined by it starts with a 
read operation. Analogously it is said that output is 
required from a module (denoted by MV <), if it is nota 
read-like module and if there is at least one occurrence 
of M” which is followed either by a read operation or by 
an input requiring module. 


3. Incorporation of data dependencies. Local 
modules from which an output is required almost behave 
like a write operation in the sense that a variable becomes 
redefined by them. This is, however, not always true. 
Whenever there is an alternative which has a read-like 
behavior (only read operations are involved), then the 
external appearance of this module becomes inconsistent. 
Sometimes it redefines the corresponding variable and 
sometimes it does not. This is exactly the situation which 
earlier was referred to as a data dependency. It is re- 
moved by introducing copy statements (like x:=x) in those 


alternatives of modules for which output is required (in 
which, otherwise, no redefinition would occur). Thus, in 
the given example, two copy statements (symbolically 
denoted by N and O) have to be introduced in the modules 
PX and QY, respectively. 


4. Introducing logical variables. Whenever one 


and the same physical variable becomes redefined, then 
from alogical point of view this is a new variable. This 
can be indicated by segmenting each alternative in such 

a way that after each write operation and after each local 
module from which output is required, a new segment 
(being the scope of a new logical variable) begins (denoted 
by ANI S® for instance). In addition, each alternative of 
an input requiring module also starts with a segment. The 
variable names for the segments canbe chosen freely with 
the restrictions that (a) all variable names within the same 
alternative have to be different, (b) different alternatives 
of the same module have to begin and end with identically 
named variables, and (c) the last variable name in each 
alternative should be that of the corresponding physical 
variable, Altogether for the given example, the follow- 
ing set of extended local production systems, incorporat- 
ing all data flow analysis information, is obtained (Table II). 


V¥i:= 8% S*::= CW] RX || 2xRQx >P& ::={ IWR wy] | RX ; 
Xa Xx xa X Xa x 
Use. GY S’::= RYQY PY ::={e| RV } 
v2:;= AW gn >g9:;=C8 po AZ gn >p" :={ € | | Ry 
n n nN 
Wns io Sveee RYO” i ee a 
v8::= 8 s8::= R&Q8 P& ::={¢ | R8 ft 
v s 
vi: = x" si >s'::= y Rf af >Pf sefe Iilrt 
f 
vP::= sP SP::= RPQP PP ::=fe| RP} 
Wise B™y s¥y Z® >S¥qs= Wt KRKM | QY il 
yay ya yoy 
Vike 6° St::= Ly Qkaa 
q 
ym. = ywi gs ™ >sM..- | 27R gm 
m m 
a RoR ISR: Wipe x 
Rei: | D PS*HS HNUP*| area ae ae Re 
xa X x 
PVA pW | 1fR FW rf IGR pWy4yR pv QY::= { e | sv} 
va Vv v 
PR oe il Sek pe SQ ees te 1 ie al sy 
n na n 
R RW — FW il aqR pw QQ’: = | € | sw} 
Hl 3 
R8::= am “ae IR ps Q g8::= { « | 58} 
>pf...27Rpf ere f 
R Wak P Qfzz= fe 1 sty 
Roeser eee >Qki:= {oR ow | syi} 
P ya y ya sy 
Qt s2= Je | S4 } 
eee reeee 
ams fel sm] 
Table Il - Extended Local Production Systems Showing all Data Flow Analysis Information 
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Program Transformation 


Transformation into a "single assignment" form 
emp Pg Es a TSS DS I OD 


The information given by the data flow analysis 
makes it possible to translate the original global produc- 
tion system into a program form in which each occurring 
variable is defined at most once (see References [ 8], [9]). 
To do this, each module is associated with three kinds of 
parameters: 


1. A decision variable. Based on the value of 
this variable, a corresponding alternative is chosen. If 
the module is an unconditional one, there is no decision 
variable. 


2. A list of input variables. This is a list of all 
those variables that the data flow analysis has shown are 


required as input to this module. 


3. Alist of output variables. This is the list of 


all those variables that the data flow analysis has shown 
are required as input to this module. 


The syntactic notation chosen is illustrated by the 
following example: 


Q(q)[n, f, yb, m; y] (q is the decision 
variable, input and output variables are 
separated by a semicolon, and y is an 
output variable), 


The right side of the definition of an unconditional 
module is an alternative, being a list of statements sep~ 
arated by semicolons and enclosed in braces. Condi- 
tional modules are described by conditional expressions, 
for instance, in the form: 


1 — alternative ] 
=2 —» alternative 2 
Xx x A 


(the truthvalue true is represented by 1, and false by 2). 
The variables occurring in each statement are taken from 
the corresponding segment of the extended local produc- 
tion systems. Altogether for the given example the 
single-assignment form shown in Table III is obtained 
(note that the previously introduced symbolic statement 
names are indicated above the lines). 


In such a program, the "basic statements" (assign- 
ment and |/O statements) and "expansion statements" 
(module call statements ) can be distinguished. The pro- 
gram can be executed in parallel as follows: Starting with 
an instance (i.e. a copy) of the begin statement V , 
execution of this statement expands into a set of state- 
ment instances of the corresponding alternative. In this 
set, an instance of a basic instruction becomes executable 
as soon as all its input variables have a defined value 
(because of the single-assignment property there is no 
misinterpretation of the definition point possible), An 
instance of an expansion statement is executable, as soon 
as its decision variable — if any ~— has been defined. 
Thus instances of unconditional expansion statements 
always are executable, Execution of an expansion state- 
ment instance evolves in an expansion incorporating the 
corresponding alternative, whereby passing by name of 
parameters is assumed and "internal" variables not occur- 
ring in any parameter list are assumed to be newly created. 


This execution mechanism gives maximum paral- 
lelism because the only sequencing constraints are given 
by the following facts: 1) a statement instance has to 
wait until it has been generated (which according to the 
program structure means that it has to wait until all con- 
trol dependencies have been resolved), and, 2) it has to 
wait for its inputs (coming either direct from the "input 
producer" in which case an input dependency is resolved 
or from a copy statement, in which case a data dependency 


B Z 


v::= {read f; read m; n:=1; ya:=0; S(Cn,f,ya,m;syJ; write y} 


Sin,f£,ya,m; yl::={xa 


K L 


=n; R [xa,n,f£;xJ; yb=yatx; gq:=n=m; 


Q(q)[n,f,yb,m;y]} 


D 
2 


F G H 


R[xa,n,f£;x]::={va:=xa : vb:=va-n; w:=2xa; g:=vb/w; xb:=xa-g; 


I 


P(p)(xa,n,f;xJ::= 
N 


p=1——~> }xX:=xXa 
RE xa,n,f;xJ} 


p= 


Q(q)ELna,f,ya,m;yl::= 
0 
q=1 ee { Y:3= 
M 


ya} 


J 
vi=|gl; pr=vsf; P(p)({xb,n,f;x]} 


q=2 —> | n:=na+1; Sin,f,ya,m;y] . 


Table Ill - The Original Flow Diagram in "Single-Assignment" Form 
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is resolved). "Overhead" statements such as all uncon- 
ditional expansion statements different from vy, are no 
obstacle for maximum parallelism, because their in- 
stances are unconditionally executable and can be re- 
garded as part of the expansion of the preceding con- 
ditional expansion statement instance. For an equiva- 
lence proof of the single-assignment program and the 
original flowchart, see Reference [ 10] . 


The drawback of the single-assignment representa- 
tion is that the control mechanism is not totally explicit. 
Although logically it is clear when a variable gets a de- 
fined value, the signalling of the arrival of this value to 
the involved statement instances is not shown in the con- 
trol mechanism. This drawback is removed in the next 
and last transformation step. 


Transformation into a "variable-free" form 


1. Introduction of "distribution statements". In 
the single-assignment program form each variable is 
defined at most once. There is no limitation on the 
number of readings from one variable, however. By intro= 
ducing distribution statements (being multiple assignments 
distributing a variable value to all places in an alter- 
native where this value is needed) a program form can 
easily be reached, where each variable also is read at 
most once. The idea of this transformation is to store a 
generated value not indirectly to a data base (from where 
it can be retrieved under its name), but directly to all 
places where it is needed (which makes a "local" deter- 
mination of executability possible). 


2. Introduction of "buffer statements". The 
problem with the exploitation of the previous transforma- 
tion is that when a value has to be inserted directly in all 
reference places, then these places have to exist, i.e. 
they have to be allocated. This means that a synchroniza- 
tion between "value definition" and "value place alloca- 
tion" is necessary, which can subvert maximum parallelism. 


The solution to this problem is the introduction of 
"buffer statements" (being copy statements), which are 
inserted between a value generating basic- or expansion- 
statement and the corresponding expansion statement re- 
quiring this value as input. No buffer statements are 
used if the basic statement is a simple one (involves no 
expression evaluation). 


3. Replacing variables by addresses. In the 


following, each module alternative is assumed to be 
arranged linearly, so that each symbol occurring in it has 
an "address" (relative to this alternative). Each alterna- 
tive later is assumed to be preceded by an "address vector", 
being the list of addresses of all module parameters with 
respect to this alternative. (Note that addresses are de- 
noted by an arrow over a variable name (e.g., 4); they 
point to the place where the plain variable name occurs 
(e.g., a) which is not to be interpreted as a variable, but 
as a "placeholder"). When a parameter does not occur in 
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an alternative, the corresponding address is denoted by 
the "null" address, >. Input parameters occurring ina 
basic statement are called direct input parameters; all 
other input parameters (occurring again in expansion 
statements) are indirect ones (denoted by an underlining 
of the corresponding address in the address vector). 


Within each alternative, each variable not being 
a global parameter occurs exactly twice. The general rule 
for the replacement of variables by addresses is that the 
place where the "allocation" is being done (or in case the 
allocation is done by the surrounding module, then the 
place where the definition is done) becomes the address of 
the corresponding mate and the mate is interpreted as a 
placeholder. Thus if both variable occurrences are in basic 
statements, then the definition place becomes the address 
of the reference place and if one variable occurrence is 
within an expansion statement (as "local" parameter) and 
the other in a basic statement, then the parameter becomes 
the address of the other variable occurence (independent of 
whether the latter is used for reference or for definition). 
Parameter lists as well as the case distincting conditions 
become redundant now. All that is needed is a description 
of alternatives, which for the given example is a self 
explanatory form is given by Table IV. (Note that the 
symbolic statement names are indicated above the lines. ) 


Program Execution 


A program obtained can be regarded as a parallel 
machine program being executed as follows: 


1. A copy of the "body" of the begin module Vv 
is fetched into a "control storage, " thereby replacing 
relative addresses by absolute ones. 


2. An instance of a basic statement becomes 
executable, if all its definition places are (absolute) 
addresses and all its reference places are values. It is 
executed by evaluating the “right side" expression, 
storing the obtained value to the indicated addresses (in 
case of a null address no storing takes place), and delet- 
ing the executed statement instance in the control storage 
afterwards, 


3. An instance of an expansion statement be- 
comes executable, if its reference place (the previous 
decision variable) — if any — is a value and if all of its 
parameter places are (absolute) addresses. It is ex- 
ecuted by fetching a copy of the body of the correspond- 
ing alternative into the control storage (if there is enough 
space), thereby replacing absolute addresses by relative 
ones and performing the following “parameter passing": 
The address of a direct input parameter is written to the 
address given by the corresponding "actual" (contained 
in the invoking statement instance) parameter. Address- 
es found in actual output or actual indirect input param- 
eter places are, however, written to the address of the 
corresponding newly allocated "formal" parameter (thus 
in this case the passing of parameters has the "conven- 
tional" direction, whereas in the former case the passing 
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Xx Y A B SV Z 
>> + > ; 
Vi::={read f; read mz; n:=1; ya:=0; S[n,f,ya,m;yl; write y} 


SN SF SM C RB RC 
> > > > > > > 
S::=L[n,f,ya,m;y] {na,nb,nc,nd:=n; fa,fb:=f; ma,mb:=m; xa:=na; naa:=nb; faa:=fa; 


2 RS | = 4 K x L QA QB QC QD 
R[Lxa,naa,faa;x]; yb:=yatx; q:=nc=ma; nda:=nd; fbha:=fb; yba:=yb; mba:=mb; 


Q 
Q(q)[nda,fba,yba,mba;y]} 


>>> > > > >RX > -+RN > _>RF +> D = > EF > F 
R::=[xa,n,f;x]{xaa,xab,xac:=xa; na,nb:=n fa, fb:=f; va:=xaa*; vb:=va-na; w:=2xab; 


Q 
am 


> > RG > > J PA PB PC 
g:=vb/w; ga,gb:=g; xb:=xac-ga; v:=|lgbl; p:=v<fa;xba:=xb; nba:=nb; fba:=fb; 


> Sai ag 
P(p) [xba,nba, fba;x]} 


>> > > >_> M SA = SQ 
Q2::= [na,f, a,m;y]{n:=nat1; na:=n; S[na,f,ya,m;y]} 


Table IV — The Original Flow Diagram in "Variable-free" Representation 


direction is reversed), by the use of variables), 

This execution mechanism is illustrated in Table V The techniques used can, if properly understood, 
by a possible execution begin for the given program and be very fruitful for the further development of many dif- 
the assumed input values f = 0.1 and m = 2. ferent areas including (optimizing) compilers, operating 

systems (paging techniques), and new (highly parallel) 

All execution possibilities for the previous pro- machine concepts. 
gram inputs are described symbolically by the "precedence- 
graph, " shown in Figure 3. References 

Analysis of the graph shows that the computation [1] A. J. Bernstein, "Analysis of Programs for Paral- 
of different square roots can be done in parallel (thus the lel Processing, ' IEEE Trans. of Electr. Comp. 
"outer" loop in the original flow diagram is a parallel (Oct. 1966), pp. 757-763. 
one), whereas the iterations required to compute the 
same square root have to be done in serial. [ 2] H. W. Bingham et al, Automatic Detection of 

Parallelism in Computer Programs, Burroughs 
Summary Corp., Paoli, Pa., Technical Report (Nov. 1967) 

The paper has shown how flow diagrams of a [3] M. R. Shapiro et al, The Representation of Al- 
certain restricted standard form automatically can be gorithms, Applied Data Research Inc., New 
transformed (at compile time) into a set of goto-free and York, N.Y., Technical Report, (Sept., 1969). 
variable-free set of modules, constituting a highly paral- 
lel program structure. The intelligence incorporated into [ 4] G. Urschler, "Automated Functional Program- 
the resulting programs not only allows the exploitation of ming, " paper available from author. 
maximum parallelism, but also provides for dynamic 
storage allocation, dynamic relocation and direct data [ 5] R. T. Prosser, "Application of Boolean Matrices 
processing (as opposed to indirect data processing implied to the Analysis of Flow Diagrams, " 1959 Proc. of 


the EJCC, pp. 133-138. 
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[ 6] E. S. Lowry and C. W. Medlock, "Object Code [ 8] L. G. Tesler and H. J. Enea, "A Language 
Optimization, " CACM, Vol. 12, (Jan., 1969). Design for Concurrent Processes, ' Proc. of the 
SJCC 1968, pp. 403-408. 
[7] C. Boehm and G. Jacopini, "Flow Diagrams, 


Turing Machines and Languages with only two [9] D. D. Chamberlin, "The 'Single Assignment' 
Formation Rules, " CACM, Vol. 9, (1966) Approach to Parallel Processing, " Proc. of the 
pp. 366-371. FICC 1971, pp. 263-269, 


[ 10] G. Urschler, The Inherent Parallelism of Flow 
Diagrams, IBM Lab, Vienna, Technical Report 
25.129 (July, 1972) p. 40. 


Simultaneously 
performed state- Result 
ment instances 


> > > > >. : 
ae read f, iread m7, s=Trya,s=O;Sin ,£, ya, -M,+y,lJrwrite Yo 
Xo Yo 
> > > > > > > > 
So read £, ;read M)7Ny s=1;3 9 ?=Or;na,,nb,,nc,,nd,:= 
> > > > 
7=M) 7 Xa ,2=nNa,;naa ;:=nb,;faa ,:=fa;RLxa;,naa, ,faa,;x J; 
> > ho QAg QB y QCo QD 
2=ya otX 7G) :=nc=ma; nda, :=nd,; fbha,:=fb,; yba ,:=yb,;mba, :=mb,; 
> Q > > > > ; Zo 
Q(q,)[nda,,fba,,yba ,mba,;y Ji write y 
> > > > SNo > > SFo > > SMy Co RBy 
X,Y yA, By, RS, na, ,nb, ,nc,,nd,:=1; fa,,fb,:=01; ma,,mb,:=2; xa ,:=naj,; naa ,;:=nb,; 


RCo > 0 = so RNG > + Po 2 


Va, t=Xaa,; 


=p + RE) 
faa, s=fa 17 Xa o , 7=Xa ,;Na, mb, :=naa,;fa 2 fb, :=faa 


1 7 
= + 
:=2xab, 7g, :=vb, /Woiga, 1g 


, 4 PA PB, 
Ig! 7P. :=v, <fa,;xba, :=xb, ;nba,:=nb, ;fba:=fb, ; 


> 


> > 0 > > : 
P(p,)[xba, nba, ,fba,;x ,Jiyba,: 


0 > Ly QA 4 
=0+x)7q,:=nc=ma;nda ,:=nd |; 


QB QCy QD) 
fba ,:=fb, ;yba ,:=yb ;mba |:=mb,;Q(q ,)(nda, ,fba,yba,mbazy_ J; 
Peay ea ee gP oe eee ag eee een Be eye ee ge 


write ¥,; 


Table V — Execution Begin of a Parallel, Variable-free Program 
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Graph, Describing Execution Possibilities 


Fig. 3 - Precedence 
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FORMAL TRANSFORMATIONS FOR PARALLEL PROCESSING LOGIC 
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Department of Electrical and Computer Engineering 
Syracuse University 


Syracuse, N.Y. 


Abstract -- Formal transformations are de- 
scribed which convert sequential processes into 
parallel processes preserving the logical behavior. 
The formal transforms are carried out on a logic 
design language. The results of the transfor- 
mations are alternative designs expressed in the 
same language. The technique is applied to sever- 
al sample design problems. 


Introduction 


Languages for describing the structure of 
computers and other digital systems are receiving 
increased attention. The motivation for such 
activity is a hope that higher order languages 
for hardware structures will provide the same 
sorts of benefits in hardware design as program- 
ming languages provide for software design. In 
particular, a satisfactory system description 
language should provide a means for coping with 
the complexity found in typical logic systems. 


The structural complexity of parallel 
processing systems is greater than that of serial 
processors. As a result the need for control of 
complexity is increased and the task is more 
difficult. 


In this article the application of system 
description languages to parallel processor 
system is examined. The questions of interest 
are: 

1. Can a system description language provide 
adequate compact and precise description of 
parallel processing logic networks? 


2. What are the transformations which can 
be carried out on system descriptions which will 
affect the speed of operation (degree of parallel- 
ism) while preserving the essential logical 
behavior? 


3. Can useful and economical designs for 
parallel processes be obtained utilizing formal 
transformations? 


4. What is the relationship between such 
formal transformations on the logic and related 
transformations on programs? 


The basic ideas behind the transformations 
required to increase parallelism of combinatorial 
and sequential logic designs are well-known [1]. 
However the implementation of these algorithms 
will depend critically on the representation 
system used for the design. Various methods of 
representing designs must be studied to determine 
the simplicity and efficiency of the operations 
to be carried out on the designs. 


The language should provide an adequate data 
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interface with other design automation programs 
for testing, layout, wiring, and simulation. The 
language described here is APL-based so it has 
the advantage of having a set of vector and array 
type operators. 


The paper describes the principal features 
of a system design language and some design 
transformations within the languages. Several 
useful logic design examples are considered. 
Designs with a high degree of parallelism 
activity are studied to determine whether these 
designs could be generated by a straight-forward 
application of automatic design transformations. 


The transformations are essentially logical 
in nature. A machine description is converted 
to a logically equivalent machine description 
where the derived machine exhibits more parallel- 
ism than the original. 


The introduction of parallelism generally 
substitutes a spatial iteration of signals for 
the original time iteration. Hence the two ma- 
chines are not equivalent in the sense usually 
used with respect to sequential machines. They 
are equivalent in the sense that there is a map- 
ping of (output signal, time) of the original 
machine to (output signal, time) in the new 
machine which preserves the logical behavior of 
the machine. 


Register Transfer Language 


Many different notations have been suggested 
for describing systems. They can be divided into 
two broad classes, those which describe behavior 
and those which describe structure. The former 
type of description particularly useful for 
simulation while the latter is useful in design 
automation systems. 


The language used here utilizes many APL 
features and is a register transfer language in- 
tended to describe the structure of a digital 
system. Since the descriptions tend to look like 
programs it is important to remember the differ- 
ences between descriptions and programs. 

1. A system description describes a 
structure and not a process; 


2. The order in which the statements occur 
in a system description has no significance. 


The designer using a system description is usu- 
ally thinking in terms of the behavior of the 
system rather than its structure. The description 
can be viewed as a specification of behavior or 

of a process. In what follows, the description 
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will be considered to specify a structure and the 
transformations will be designed to derive alterna- 
tive structures with equivalent behavior. 


Kernel Language 


The kernel language is the most primitive 
form of the language which is adequate to describe 
any system describable by the complete language. 
The strategy used here is to define a very simple 
kernel language whose properties are simple. Then 
complex linguistic facilities are added to the 
language and these facilities are defined by a 
translation process which eliminates the oc- 
currence of a complex feature and yields an 
equivalent description in the kernel language. If 
this technique is used the complete sophisticated 
language can describe no more than the kernel 
language. However the complete language will 
generally allow vastly more compact system 
descriptions with no loss of precision. The same 
technique has been suggested for defining program- 
ming languages [2]. 


There are only five types of statement in the 
kernel language: 

1. The conditional register transfer, 
A$B-<« C, having the form <name>$% <name> <<name 

2. The synonym statement, A = B, having the 
form < name > = <name> 

3. The AND statement, A = AND (B, C, D) 
having the form <name >= AND (< name list >) 

4. The OR statement, A = OR (B, C, D) 
having the form <name >= OR (< name list >) 
and 5. The NOT statement, A = NOT (B) having 
the form < name >= NOT (< name> ) 


With a sufficient number of statements in 
the kernel language any network of logic involving 
registers and logic gates can be represented. 
The form of the conditional transfer implies 
synchronous logic and the exact logic associated 
with the register input is unspecified. The 
kernel language cannot describe asynchronous 
objects such as delay lines and one-shots without 
the addition of new kernel statements. 


All the sophisticated linguistic facilities 
which are added from this point on are defined 
by means of a translation process which eliminates 
complex structure and derives an equivalent set 
of kernel statements. 


The kernel language is extended by 

1. extension of naming to allow naming of 
vectors and arrays of higher dimensions, 

2. &xtension of operations to apply to 
vectors and arrays 

3. addition of programming language to 
allow computation to generate primitive language 
statements, 

4. facilities for defining macro system 
descriptions with formal parameters, 

5. facilities for declaring types such as 
register, arithmetic variables, etc., and 

6. the addition of a set of functions whose 
values are related to the system description 
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parameters. 


With these extensions precise and compact 
descriptions of digital systems can be developed. 
The complex networks associated with MSI and LSI 
usually exhibit sufficient repetitive structure 
so that the facilities of the language can be 
used to good effect. 


APL conventions are used to extend the range 
of operators to vectors and arrays. The macro 
facility corresponds to function definition 
within a programming language. The mention of a 
macro name with actual parameters specified calls 
for the addition of the text which is the body 
of the macro, with formal parameters replaced 
by actual parameters. A conventional programming 
language can be used to control the generation 
of text and the computation of literal subscripts. 
It is important to realize that the programming 
language portion of the system description is 
not used to define a program but to generate a 
body of text. 


The addition of the sophisticated linguistic 
facilities does not extend the range of system 
which can be described. Each of the added 
linguistic types can be translated into an 
equivalent set of primitive statements. The 
technique has been proposed to simplify the 
concepts underlying conventional programming 
languages. The advantage is that the range 
meaning of a description is not changed by the 
sophisticated techniques of description. The 
description still corresponds to specification 
of a network of gates and registers. 


However repeated use of macro system 
descriptions permit the design objects to 
correspond to more and more complex networks. 
The system description language provides desira- 
ble simplification of the description as long as 
there is some regularity and iterative structure 
in the network. 


Register Transfers 


The basic algorithm to be used for speeding 
up sequential logic has the effect of doubling 
the computational rate. The derived machine 
does in one clock cycle what the original machine 
does in two cycles. The equivalence relation 
between the two machines relates pairs of inputs, 
and outputs which occur in time sequence to 
pairs which occur in spatial sequence. The 
algorithm generates a set of combinatorial logic 
equations virtually identical to the original 
register transfer equations. The combinatorial 
logic equations generate the intermediate values 
of the register variables so a double time step 
is performed on each clock beat. | 


The following single statement is a de- 
scription of binary counter. 
1 ¢ A+«+X 
NET (A:X) 
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V NET(B:C) Nefinition of NET 
C=BtD 
NET2 (B:D) 

V 

V NET2 (E:F) Definition of NET2 
I< 0 
F[I] =1 

G: F[I + 1] = F[I] AE[T] 
+> ((I « I +1) < pE)/G 

V 


The definition of NET2 specifies a network 
with input E and output F as formal parameters. 
The value of F is 1 in all positions corresponding 
to consecutive 1's in E and in the position of 
the first 0. The diagram is shown in Figure la. 
The network defined by NET includes an occurrence 
of NET2 and has a bank of exclusive-or gates in 
addition. Hence, the total diagram is as shown 
in Figure Jb. 


To describe the system which will count two 
for each unit of time it is only necessary to 
duplicate the network 

1 $ A<Y 

NET (A: X) 
NET (X:Y) 


yielding a net of the form shown in 
Figure 2. 


Our example is a very simple one but the 
basic idea is the same in what follows. The next 
step in our simple example is to take advantage 
of the array naming features of the system 
description language. Consider the extension to 
an array of logic which causes the counter to count 
by N in each unit of time. A network generating 
macro call UNET can be used to replicate the 
network to form an array. 

1 $$ A< W(;N) 

UNET(A: N: NET: W) 


V UNET(a: n: net:w) 

i< 0 

wl; OJ =a 
C: net(w[; i]: wl; i+ 1]) 
> (m>itit+i1/c 


The macro UNET when mentioned generates the 

system description of a network of N binary counter 
networks connected end to end so that a count up 

by N occurs. Refer to Figure 3. 


The example is a simple one involving only 
one register, no conditional transfers and no 
inputs. The transformation method can be ex- 
tended to cover the more general case. A de- 
scription of a serial adder is the two statements 
shown below. It is a slow serial adder since . 
the shifting and the addition are not overlapped 
in time. 


t $ Ses ee (A, B, C); C + MAJ(A, B, C); 
t+-t 
£2 SO Soe ee 
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A and B are assumed to be input strings 
representing the numbers to be added. 


The conditionals can be brought over to the 
right hand side to obtain 


1$ S<«+X;C<Y3; t+ ty 
X= (ta (A+tBt+C), 14S) VC t) ags 
Y = (t A MAJ(A,B,C) )Vv~taAc 
c = ~t 


Networks with formal parameters can be defined 


NET4 (a: b: c: s: t:x ) 


NET5 (a: b: ec: s: ty) 
where the defining equations are essentially as 
above. 


A and B are external input sequences for 
which subscripts can be used to designate 
successive inputs to an array. The macro 
definition of a network for doing n steps of the 
original machine is 


V UNET2 (a: b: c: s: t: x: y: n) 


cl1[0] =c 
i<0O 
C: NET4 (ali]s bli]: cl[i]: slf[i]: t[i] 
x[i]) 
NET5 (ali]: b[i]: cl[i]: slf[i]: tli] 
yli]) 


cl[i + 1] = ylfi] 
sl[;i + 1] = xl[;i] 
tli + 1] = tfi] 
> (m>i<«-iti) /c 


y = yl[i] 
x = xl[;i] 
V 


A single mention of UNET2 with n = 2 will 
result in a logic network which overlaps in time 
the shifting and adding. It can be seen that 
actually two independent systems are formed. The 
first does a shift and add simultaneously while 
the other does an add and shift simultaneously. 
The network is a spatial sequence of combinatorial 
networks. Only one of the two networks will be 
active depending on the initial value of t. 


This phenomenon of generating multiple 
systems is a general one in the transformation. 
The transformation generates n machines which 
differ from one another in phase. The initial 
conditions will normally cause a selection of 
one of the machines. For example, in the case 
of the serial adder are initial condition of t=l 
selects the machine which adds and then shifts 
in the spatial iteration. 


If the transformation is applied again with n = N 
an Ni bit parallel adder array is produced. 


Combinatorial Logic 


The combinatorial portion of system de- 
scription corresponds to a set of boolean 
equations. The equations describe a multiple 
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input, multiple output network. Usually the de- 
signer will have utilized the facilities of the 
language to describe the iterative portions of 
the combinatorial logic but the description can 
be translated to a large set of primitive 
equations when necessary. 


For networks of this type the maximum depth 
is defined as the maximum number of gates which 
must be passed through in going from an input 
through the network to an output. The trans- 
formations described have the effect of in- 
creasing this depth so that very large deep 
combinatorial nets can be generated for array 
type logic. The delay through the network is 
proportional to its depth and the delay can 
become a decisive faster in the overall speed 
of the system. 


The algebraic identities needed to reduce 
the depth of a network of gates are well-known 
and various strategies can be utilized in trans- 
forming a network. Depth reduction transformations 
are shown in Figure 5. 
theorem is used to push the inverters through the 
AND, OR gates in order to produce subnetworks on 
which the processes of parenthesis removal or 
multiplying out can be performed. In practice a 
number of practical constraints must be observed. 
The transformations must not eliminate output 
wires and gates whose outputs drive more than one 
gate must be transformed with care. In addition 


there are normally fan-in and fan-out limits which © 


will eventually be exceeded. Fan-out limits do 
not affect the achievable speed since network 
duplication can be used to provide the necessary 
number of outputs. 


The structured nature of the system de- 
scriptions permits a kind of controlled reduction 
of the delay in the combinatorial portion of the 
network which is different from the technique of 
reducing to primitive statements and applying 
boolean algebra transformations. We expect that 
the large delay values will be generated by 
network forms such as shown in Figure 6. A 
combinatorial net is replicated and interconnected 
in such a fashion that the delay is proportional 
to the degree of replication. In the system 
description this would appear as a definition in 
which the outputs and inputs wires are connected 
according to some recursion formula. 


To reduce the total delay there are two 
main choices; reduce the value of A the delay 
per network element, or form a new network 
element which need be replicated by a lower factor 
without increasing A by the same factor as shown 
in Figure 7. The replicated network is assumed 
to be arbitrary complexity. 


If the system description for the combi- 
natorial network is simply translated by macro 
substitution to form a large set of primitive 
gate statements, the iterative structure of the 
network is lost, or at least hidden. As a result 


The process using DeMorgan's 


the task of transforming the set of equations to 
an alternative set which has smaller depth cannot 
easily take advantage of the iterative structure. 
It is desirable to separate the two methods of 
reducing overall depth. In the first case an 
attempt is made to reduce the value of A, the 
delay associated with one element of the repli- 
cated network. This requires reduction of that 
element to primitive gate statements and the 
application of the boolean algebra transformations 
to reduce the delay. Having done the calculation 
once, the result can be used to realize the 
replicated elements. 


The second case requires definition of a 
larger more complex network element so that the 
replication factor is reduced. Then the larger 
defined element is processed to reduce the depth 
and to reduce the total delay. The definition 
of the more complex network element can be 
obtained in a straight-forward way from the 
definitions of the orginal network. 


Assume that the replication factor is to be 
divided by 2 by combining the functions performed 
by two elements. If the interconnection is 
simple linear one then the process proceeds as 
shown in Figure 8. 


Assume NET(A: B: C: D) is defined and is a 
replicated element. Then if I, Z are n element 
vectors the linear interconnection can be 
defined by 


V LINET(1:Z) 


NET(I: E: Z: F) 
E=l19qF 


A double element equivalent to two linearly 
interconnected NET elements with linear inter- 
connection would be NET2(A; B; C; D) and defined 
by 

V NET2(A: Bz: C:D) 


~NET(A[O]: B: C[O]: x) 
NET(A[1]: X: C[1]: D) 


V 
The linear interconnection of four elements 


is shown in Figure 8b. 
The maximum delay is unaffected by the change. 


The advantage of the new form is that NET2 can 
now have its delay reduced using the boolean 
transformations of the primitive statements 
corresponding to NET2. The complete network is 
can be described by a n/2 replication of NET2. 
For the more general case combining N netwo.ks 
into a single element UNETN may be defined which 
consists of N of the original elements with the 
internal connections defined by the recursion 
formula of the original network. 


The main steps in forming NETIN are given 
below: 


1. Replicate NET 
NETN = No NET 


2. Form internal connections 
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from the definition of casade structure 
we have: 


D[I] = B[£(1)] 


where typically f[I] = I + CONST 

Hence, an internal connection is specif- 
ied if 

(f£(1) +N) =O forI<N 


otherwise an external connection is 
needed. 


3. Form a Linear Structure of NETN elements 
LS = KP NETN 


4. Form of Connections between the NETN 
elements 


D[I] = B[f(I)] 
DLL ; M] = B[f(MAN+L) |N] [£(M*N+L) +6] 


Which includes previously defined in= 
ternal connections. 


Comparisons 


The processes described here are intended 
for use in a software system to aid the logic 
designer. The techniques shown are quite differ- 
ent from those which are under study for parallel 
programming and for parallel organization of 
computing systems. In parallel programming 
studies a basic control mechanism is assumed and 
parallelism consists of allowing two or more 
controllers to proceed more or less independently. 
The logical and arithmetic processes being per- 
formed are considered only to the extent that 
they influence the flow of control. In register 
transfer system descriptions no real distinction 
is made between control and processing activity 
although such distinctions may play an important 
role in the thinking of the designer. 


When a logical transformation is performed 
on the system to speed it up the logical networks 
are replicated if there is no possibility for 
concurrent operation. However the network is 
not replicated if concurrent operation is possi- 
ble. This is illustrated in the example of the 
serial adder in the paper. The first speedup 
caused an time overlapping of shift and add with 
essentially no increase in hardware. Further 
transformations to create parallel addition 
required replication of the basic adding network. 
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N Step Counter 
Figure 3 
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Figure 4. Add and Shift Spatial Sequence 
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5b. DeMorgan's Theorem 
Figure 5. Depth Reduction for Combinatorial Nets 
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Figure 6. Network Form with Large Delay 
Miya ea a 


Figure 7. Network Form with Reduced Delay 
A 


Elementary Network 
Unit 8a 


Linear Network of 4 


Unit 8b 


Network of Complex Units 


C{0;] 


C[0;2] 


Figure 8. Reduction of Replication Factor 
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A STRUCTURED APPROACH TO CONCURRENT PROCESS SYNCHRONISATION 


Santosh K. Shrivastava 


(a) 


Computer Laboratory, University of Cambridge, England 


Summary 


This paper briefly describes a concurrent process 
synchronisation method to be used with secretaries 
[1] or monitors [2]; an operating system structur- 
ing concept developed by Dijkstra and Hoare. 

The well known example of readers and writers 
[4] is used below to illustrate the method and 
the monitor concept. 
file:monitor; 
begin rr, aw: shared integer; free: shared boolean; 
nowriter: condition (aw = 0); 
noreader: condition (rr = o & free); 
procedure startread; 
begin await nowriter; 
procedure endread; 
begin switch:boolean; switch:=false; 
with rr do begin rr:=rr-1; 
if rr=o then switch:=true; end 
if switch then test noreader; 
end endread; 
procedure startwrite; 
begin with aw do aw:=awtl; await noreader; 
with free do free:=false; 
end startwrite; 
procedure endwrite; 
begin switch:boolean; switch:=false; 
with free, aw do begin free:=true; 
aw:=aw-1; if aw=o then switch:=true; end 
if switch then testall nowriter else test noreader 
end endwrite; | 
with aw,rr, free do begin aw,rr:=o; free:=true;end 
note give initial values; 
end file; 

A file is to be used for reading or writing. 
Any number of 'readers' may read simultaneously, 
but 'writer' must have exclusive use; further, 
writers are given priority. 

Calls on a monitor procedure are of the form: 
monitor name,procedure name (--parameters-—-) ; 
Thus, the readers will use the code: file,start- 
read; ‘read operation'; file.endread; to use the 
file. A monitor is treated as a critical region 
so that processes have exclusive use of it. A 
"condition variable' represents some condition 
for the resource use, expressed as a boolean ex- 
pression involving the monitor variables. With 
each condition variable we also associate at com- 
pile time, (a) two boolean variables 'state' and 
"current', when 'current' is true, the value of 
"state' is taken to represent the value of the 
expression, when 'current' is false, this is not 
so, and (b) a queue for waiting processes. The 
operation ‘await condition name' is defined as 
follows: if ‘current' and 'state' of that condition 
variable are true, the executing process continues; 
if 'current' is true and 'state' is false the 
proces§ releases the monitor exclusion and waits 


with rr do rr:=rrtl; end 


(a) 


On leave from the Plessey Co. Ltd. 
Poole, Dorset, England. 
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on that condition's queue; if 'current' is false, 
the process evaluates the expression and sets 
"state' accordingly, 'current' is made true, the 
process now continues or waits as described above. 
The operation 'test condition name' is defined as 
follows: if 'current' and 'state' of that condition 
variable are true, the executing process removes 
one waiting process (if any) from the condition's 
queue and puts it on the queue of processes trying 
to enter the monitor; of 'current' is false, the 
process evaluates the expression, sets 'state' 
accordingly, 'current' is made true, if 'state' 

is now true, a waiting process is scheduled as 
described above. A 'testall condition name' 
operation is similar, except that instead of one, 
all the waiting processes are scheduled. A 
resumed process, when given entry to the monitor, 
reexamines the 'await' condition as described. 

The monitor variables that occur in the condition 
expressions are declared shared, operations on a 
Shared variable are permitted only through the 
notation ‘with shared variable name do S.' This 
operation is defined as follows: all the ‘current' 
bits of the condition variables, condition ex- 
pressions of which refer to that shared variable, 
are made false, then S is executed. No ‘'test' 

or 'await' is permitted inside S. 

It is now easily seen that in this syn- 
chronisation method, evaluation of condition ex- 
pressions is kept to a minimum. Thus, when 
readers are reading, the first writer to find 
this will make 'state' of 'noreader' condition 
false and 'current' true. Any other writers 
entering the monitor consequentively now, do not 
evaluate 'noreader' to find out that they must 
wait. As conditions can be arbitrarily complex, 
when resources are heavily utilized, this method 
particularly becomes attractive. 

The method can be incorporated in high-level 
software writing languages with the monitor 
concept. A detailed evaluation of various syn- 
chronisation techniques, including the existing 
proposals [2,3] and parallel programming tech- 
niques using monitors will appear in the author's 
Ph.D. thesis. 
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Abstract -- Two methods for employ- 
ing parallelism in tape-sorting are pre- 
sented. Method A is the natural way to 
use parallelism. Method B is new. Both 
approximately achieve the goal of reduc- 
ing the processing time by a divisor 
which is the number of processors. 

I. Introduction 

It is reasonable to assume that one 
is willing to use P processors instead 
of one if the computation time is cut 
down by the same factor. In certain ap- 
plications this has been shown to be in- 
possible. 

Fortunately, this kind of saving in 
time is possible in the case of external 
sorting. Two methods for achieving this 
goal are described. The first one is 
natural and uses known techniques. The 
second method uses new ideas and is be- 
lieved to be more elegant and easier to 
program. 


The description is in terms of tapes, 


but any linear mass storage can be used 
instead. 
II. Method A 

Assume we have N records, P_ pro- 
cessors and 4P tapes. Also assume that 
initially the N records are all stored 
on one tape. The sorting is achieved 
through the following steps: 


(1) The N records are distributed to 
2P tapes in such a way that each of them 
has approximately N/2P records. This 
step takes N units of time. 


(2) Every one of the P 
assigned 4 tapes: 
with N/2P records on each, and two are 

empty. Each processor performs the well- 
known algorithm of tape-sorting using the 
4 tapes it controls. (For a few more de- 
tails see the Appendix.) This step takes 


processors is 


N N 
p 1°82 p 


units of time. 


(3) We now have P 
loaded with a group of 


tapes which are each 
N/P records and 


+t Visiting at the Department of Computer 
Science, Cornell University, Ithaca, New 
York, summer 1973. 


two of them are loaded, 
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the records on each of these tapes are 
sorted. We perform loge2P phases of 
sort through merge. In the first phase 
every two groups are sort-merged into a 
group of 2N/P records. In the i-th 
phase every two groups of 


gi-ly 
P 


records are sort-merged into one group of 


2°N 
P 
records, etc. The whole process takes 
log P 
S.. Ot Bw oN a oe 
; P P 
i=1 


units of time. We conclude that the time 


method A takes is 


S (logs? +2) + 3N . 


K log,N - (1) 


P 


The method 
very large P. 


can be used even with a 
In the extreme case when 


P = N/4 the sort time reduces to 3N . 
However, the more practical cases are 
when P < log,N , when (1) is well ap- 


proximated by 


Ls log,N + 3N . (2) 


P 
Except for the 3N term this 
achieves the best possible saving; namely, 
the best sorting time for one processor, 
which is N log,N, is divided by the num- 
ber of processors. 
III. Method B 
For simplicity, let us assume first 
that N is a power of 2 and that the 
number of processors available is 
P = log,N ae 
We shall use 4P tapes (in addition to 
the input tape). As we shall see later, 
the number of processors can be reduced. 
to log,N and the number of tapes to 
4(log,N ‘ee 
The tapes are divided into quadru- 
ples: 
i i i 
2 32 Ty 


Tro Tes T 
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for i = 1,2,...,P.:. Time is measured in 
the unit of time necessary for reading 
and writing one record. The processors 
are denoted by Il,,I,,...,Ip . 

During time 1< t<N I, reads the 
input tape and writes the records on 


Ne rae . and Ti, 
according to the following rule: 
(i) if t = 1 (mod 4), I. writes on ie . 
(ii) aif t = 2 (mod 4), I, writes on vs ; 
(iii) if t = 3 (mod 4), I, writes on i 5 
and 
(iv) if t = 0 (mod 4), 1, writes on T} . 
| I, ais active during 
a ae ei ee ae a Oe 
For k = 2,3,...,P its activity is as fol- 
lows: It reads from tapes of the (k-1)st 


quadruple and writes on tapes of the k-th 


quadruple. It performs, repeatedly, a 
sort-merge of two sorted lists of length 
9k-2 


into one sorted list of length 


ok-1 ; 


The tapes are used according to the fol- 


lowing rule: (a) 
(i) if 
k 
od = 1 (mod 4) 
k-1 
2 
then ql reads from 
t and ter 
and writes on 
k 
T, : 
(ii) if 
k 
Bae 2). 22>: (aed A) 
k-1 
2 
then ql reads from 
k-1 k-1 
Ty and Ty 
and writes on 
k 
T, ; 


Let [x] denote the least integer 


(a) 
i.e. [3:05 = 


which does not exceed x, 


[4] = 4. 
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(iii) if 
k 
age 3 (mod 4) 
k-1 
2 
then q reads from 
c- and nie 
and writes on 
tT » and 
(iv) if 
k 
En2 +2 = 0 (mod 4) 
pk-1 


then I reads from 


aie and te 
and writes on 
k 
tT, 5 
An example of N = 8 is shown in 


Successive 
A solid 


the diagram on the next page. 
rows represent successive time. 
line in the column 


i 
means that I 


i 
1 


in row t is writing on 


k 


during time t; a broken line means that 


Ted 
may be reading from it. 
The reader may establish for himself 
the validity of the following claims: 


(1) Every tape is emptied (the records 
it has contained are read) before it is 
loaded again with a sorted list. Thus, a 
tape of the k-th quadruple never contains 
more than 


ok-1 


records. 


(2) The reason for the difference in the 
starting times is that Ty is waiting 
for 


1 k-1 


TI to load oe and T, r 


k-1 
This takes gk-1 
starting time is 


ok-1 yy 


units of time. Thus the 


& oFel. 2 


namely, Ky j 
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nN 
----- —- -e 


Ww 
_—!-——---—--- 
——— 


ON 
-------a 
—, —--—— — — 


Zt 


22 


is first loaded during 


Thus, 


5 

aN 
For N > 4 this is larger than N 

Since ‘ie is not used after t =N , we 
can use the same tape for both tasks. 


Gy aes 


t = 1. 


is first loaded during 


t= 2P-l _ 4 4 2P-2 


Figure 
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t 1 
1 t 
1 1 
4 
1 1 
I 
" 
' t 
( 
\ 
i 
i] 
; 
t 
} 
' 1 
{ 1 
' 
| ' 
1 t 
1. 
Thus, 
- 2 
t = aN 1 
For N > 2 this is larger than N. 
Since Tj; is not used after t = N , we 
can use the same tape for both tasks. 
(5) To and i. are never used. 
(6) 7 is first loaded during 
Cao 1 Ow Tr 
For N > 3 this is larger than N + 2 
Since i is not used after t = N + 2 


we can use the same tape for both tasks. 
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(7) re tTPooand TP are never used. 
2 3 & 


(8) Claims (3) to (7) imply that only 
4(log,N-1) tapes are necessary. 


(9) 
I, and I 


can be the same. 
(6). Thus, only 
required. 


(10) During t = N there are log.N 
processors in action and 4(log N-1) 
tapes are occupied. Thus, no further 
saving is possible unless basic changes 
are made in the procedure. 


The reasons are as in 
log,N processors are 


The whole process takes 3N-2 units 


of time, and 


re 


1 
is the output tape. This compares favor- 
ably with Method A (see (2)) which re- 
quires approximately 4N units of time 
in case P = log,N . In my opinion, 
Method B is more elegant md easier to 
implement. 

Let us now discuss the case 
P # log,N - (The reader should notice 
that we have started with P = log,N + 1 
but have reduced the number of processors 
to log,N ) . 

Method B is not suitable for using 
much more than log,N processors. 
Clearly, when log,N is not integral we 
can use 


[1og,N| 
processors and pretend we have 
2 [1og,N| 


records by filling in "dummy records". 
(Some improvements on this are possible 
but essentially the processing time is 


322 | tog.N| _ 2 


A similar problem occurs in Method A if 
P is not a power of 2.) However, there 
is no way to use more than 


[1og,N| 


without changing the method considerably. 
More interesting is the case when 

P < log,N . For simplicity, let us dis- 

cuss the case of 


N = 9 (P-1)Q 


where both P 
tegers. Let 


and Q are positive in- 


The computation is done in Q 


passes. In each pass the output tape 
contains output lists which are M times 
longer than before. The number of pro- 
cessors used is P and the number of 
tapes is 4P - 2 . (Ignore here the sav- 
ings discusses in Claims (3) to (10); 
only 


: and cP 


T b 


are not needed.) 

In the first pass we use the same 
procedure as discussed before, except 
that after 


P 
T 


is loaded with a sorted list of length M 
another sorted list is loaded next to it, 
etc. This continues until all N_ re- 
cords are on 


P 
T; 


in sorted lists of length M. 
In the second pass 
P 
Ty 


is used as the input tape and 


P 
tT, 
as the output tape. The length of the 
list on 
1 » k<P,is M gees 


and the timing is now in multiples of M. 
The lists on 


T 


are now of length MM’. 


After Q passes the sorting is com- 
plete. 
The i-th pass takes 
aw Oe 9) 
units of time. Thus, the total time is 
ae ee 
gen + (27-2) 5 mint 
i=1 
Q 
P M*‘-1 
= Q*N + (2° -2) Mol 


Qn + 2(m2-1) 


N-log,N 
[Se ZN 1) (4) 
Poy 


which is similar to (2). 

The method can be improved by start- 
ing the next pass before the present one 
is over. However, this will only reduce 
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the second term of (4). 


Tape-Sorting by One Processor 
and Four Tapes 


Appendix: 


This well-known and widely used algo- 
rithm runs as follows: Assume the re- 


cords are stored on two tapes, fT, and 
T, , each containing n/2 records, while 
the other two tapes, T3; and T, are 
empty. Also, assume the data on T and 
T, is already partially sorted in the 
following way: The n/2 records are di- 
vided into groups of 

i 


Z 


records. Each of these groups is already 
sorted, say from low to high, and the 
groups are stored consecutively on the 
tape. Thus, the number of groups on each 
tape is 


_n_ 

gitl 
Initially i-=o0O. 
us assume that 


For simplicity, let 


The algorithm goes through 2 Phases. 
In Phase 1 we read the first record 
from each input tape (T, and T,) and 
store both on T, in increasing order. 
Next we read the second record from each 


input tape and store both on T, in in- 
creasing order. Next we return to load a 
group of two on T, , etc. After n 


units of time (since each record is read 
once and is written once) all the records 
are distributed to T, and T, in or- 
dered groups of size 2 (=2?). 
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In Phase i, i < & , we perform a 
merge of the two groups which are present- 
ly on top of the two input tapes and store 
the merged double size group on one of the 
output tapes, alternatively. (If i is 


odd then T, and Ty, are the input tapes 
and T,; and T, are the output tapes; 
if i is even, tasks are reversed.) The 


merging of these two groups is achieved by 
reading the top record from each group and 
writing the smaller one on the output 
tape. After each such writing the top 
record from the same group, as the one 
which has just been written, is read and 
compared with the record still in memory, 
etc. This is continued until one of the 
groups is exhausted; the remainder of the 
other group is directly transferred to the 
output tape. 

We continue merging groups of size 


gi-l 
one from each input tape, into groups of 
size 
oi 


which are stored on the output tapes, 
changing the output tape after each 
group. 

In Phase 2 , 
group of size 


there is one sorted 


on each input tape and they are merged 
into one sorted group of size n which 
is stored on one of the output tapes. 

The whole operation takes n log,n 
units of time. 
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A PARALLEL ALGORITHM FOR MAXIMUM FLOW PROBLEM 


Yu K. Chen and Tse-yun Feng 
Department of Electrical and Computer Engineering 
Syracuse University 
Syracuse, New York 13210 


Summary 


This algorithm is developed for solving the 
maximum flow problem in an associative processor. 
It is based upon the matrix multiplication 


approach [1] for finding the flow route. Given 

a capacity matrix ct = [c, jie with 7 to be the 
capacity from node i to one j and se 00 for 

all values of i, and ee to be the first row of 

1 D) co 

C’, one can generate Cr,» C Tae to C” sake eecuenees 


by the matrix multiplication Cc = col x ct » where 


the ordinary matrix product is eae with the 
following modifications: (1) CR! C5 =- min 


(Cops C5) and (2) : CuK = (c.,)> Under 
the new definitions of matrix multiplication, 


m : m 
Cia the i's element of Ci» clearly represents the 


maximum flow between the source and the node i by 
means of paths which have m branches or less. The 
multiplication process stops either when c 


m 

1n # 0 
or when m=n- 1. Unlike the previously 
proposed sequential labeling methods [2] - [3] 
that the trace of the path has to be carried out 
along with the labeling process, the construction 
of a trace matrix T = [t,.] proposed here can be 
performed at the conclusion of the matrix 


multiplication. Matrix T is a zero-one matrix. 
m m-1 
.. = 1 if c_.= coe Be > and i i: 
ae 1j “qi “45 0, 7 a3 
= Q. Matrix T contains one or more 


otherwise, t.. 
tJ 


paths. To select a single path matrix P = 


backward trace technique can be used. The 
algorithm is designed to fully utilize the word- 
parallel and the fast search-retrieval capabili- 
ties of the associative processor to gain 


(a) 
Node 1 is assumed to the source node and node 
n to be the sink node 


[Pi ], 
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execution speed. A few transpose operations are 
required in this algorithm. Therefore, if a data 
manipulator [4] with the transpose function in it 
is provided will certainly help the execution 
speed. The multi-terminal network flow [5] - [6] 
is not considered. This algorithm has been coded 
in APL to emulate its execution in associative 
processor [7]. Results are compared with the 
algorithm proposed in [8]. Approximately two- 
to-one improvement in execution time is 
indicated. 
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PARALLEL - SEQUENTIAL PROCESSING OF FINITE PATTERNS* 


William I. Grosky and Frank Tsui 
School of Information and Computer Science 
Georgia Institute of Technology 
Atlanta, Georgia 3034] 


Abstract -- The various basic conditions 
under which parallel, sequential and mixed par- 
allel-sequential processing in tessellation 
structures, using the same local transformations, 
are equivalent in terms of pattern generation 
are studied. Various necessary conditions and 
sufficient conditions for equivalence are derived. 
We then illustrate a 'mutually destructive’ con- 
dition where sequential and parallel processing 
cannot be made equivalent, and study this condi- 
tion further. We finally relax some of our hypoth- 
eses and countenance the notion of simulation 
between parallel, sequential and mixed parallel- 
sequential processing, giving sufficient condi- 
tions for such simulations to exist. 


Several recent research efforts are aimed at 
strengthening the theoretical understanding of 
parallel and sequential modes of picture and 
pattern processing [ 1, 2, 4; 5, 6, 9]. Rosenfeld 
and Pfaltz [ 7] have shown that any picture 
transformation that can be accomplished by a 
series of parallel local operations with Moore 
neighborhood index can also be accomplished by a 
series of sequential local operations with Moore 
neighborhood index, and conversely; but, the local 
operations may be different for the two types of 
processing. 

In this paper, we first concentrate our In- 
vestigation on the equivalence of parallel, seq- 
uential and mixed parallel-sequential local oper- 
ations of arbitrary neighborhood index in arbit- 
rary dimensions, where the local operator is the 
same for each mode of processing. We then relax 
this latter condition and explore the notion of 
simulation in general. The methodology used in 
this work is that of tessellation automata. We 
have generalized previously formulated definitions 
of these entities to take into account sequential 
processing, and we call our new entities the class 
of stratified mixed mode tessellation automata. 


A TST TE STE — MS SID 


Stratified Mixed Mode Tessellation Automata 


‘A-tuple <S,Z",NI,GT>, where, 
1) S is a finite, non-empty set of states 
2). 2" is the set of n-tuples of integers. For 
ce z", we call t a cell of TA. The set CON = {g | 


g:Z" + S} is called the set of configurations of 
of TA 


* This work was supported in part by NSF Grant 
GN-655 


3) NI is an ordered q-tuple of elements of 7. 
for some q 2 1, and is called the neighborhood 


index of TA. Suppose NI = <Yyoeees¥gr+ Then, for 


(ae Za Ne(z) = pes Oe org is called the 
neighborhood of Z. 
4) GT # B, called the set of global transfor- 


mations, is a finite subset of con©ON which is 


the union of qT GT; Bt and GT. p? the sets 


? 
of parallel, sequential, parallel-sequential and 
sequential-parallel global array transformations, 
defined as follows, 


a) Suppose p ¢ GT and let c e« CON. Then 
o(c) = c' © CON, where, for some 0,384 > 35 
called a local transformation, we have, for each 
ge 2", that c'(c) = o (clety)) ,--+,clet+y,))- 
c' is called the successor configuration of c 
with respect to po. Thus, the state of a particu- 
lar cell in a successor configuration of c de- 
pends on the states of the neighborhood of that 
cell in configuration c. - 


b) Suppose p e GT. eT, s¥ GT. 


o(c) = c' e CON, where, for some oS > § and 
ny j n ee . 
TE 2 4 )J uv (2")°, for T injective, called 


J 
a trajectory, we now define c'(z) for Z « Zs 


- then 


i) Suppose p e GT,. Then we require t, to 
be surjective as well as injective. For 
O<ic< te), define te) e¢ CON as follows, 


oe) = 


= 1 {k) 
)(z) otherwise = 


(z) 
7 (x) 


ii) Suppose po ¢ CT, 5° Then we require T, 


Then, c'(z) = c (x) 


3 
to be non-surjective. Define c* e CON by 


{o_(clery,),...,cléty_)) if 
c*(z) 3 a Chey : 2Yq zd range(t ,) 


tele) otherwise 
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Then, 
ee if c¢ range(T, ) 

c'(2)> ¢ 4G) -1,., ©) if 5 © range(t,) 
t (ol range(e, )? () 


iti) Suppose p € GT. + Then we require t 


9 p 
to be non-surjective. Define c** € CON by, 


(el. ) 4 (6) if ee range (T ) 
okk(E) = aS o| range(t,)) €) 

tele) if & ¥ range(t,) 
Then, 

{exe(5) if 5 © range (t,) 
c'(Z) 


= { 
O (owe (C+Y. ‘Jaeee se (CY )) if 
tp “4 ce range (tT, ) 


In the above three cases, the trajectory 
indicates the sequential order in which the cells 


of 2" are processed. Case i) is pure sequential 
processing in which the state of a particular cell 
in a successor configuration of c is determined 
by the states of the neighborhood of that cell in 
configuration c#, where c# differs from c only in 
that we update the states of all cells processed 
before the given one. Case ii) results when some 
cells are first processed in parallel and then the 
remaining cells are processed sequentially. Case 
iff) results when some cells are first processed 
sequentially and then the remaining cells are 
processed in parallel. 


Suppose a:S4 + S$. We define par(0) € G,, to be 


the parallel global array transformation deter- 
mined by o, seq_(¢) € G_ to be the sequential 


global array transformation determined by o and 

the surjective trajectory 1, par-seq_(c) € G, 
? 

to be the parallel-sequential global array trans- 

formation determined by o and the non-surjective 


trajectory t, and seq-par, (co) « G_ 5 to be the 


sequential-parallel STobal array Ree 
determined by o and the non-surjective trajectory 
Te 

We now define various concepts which will 
prove to be useful in the balance of this paper. 


DEFINITION 2: For o:$7 +58, if o 
of its j-th argument, for 1 <j < 
oy an independent neighbor of ~ with respect to 


o, for each ze Z". 


DEFINITION 3: For zc ec z" and ta trajectory, we 
define the preprocessed set of ¢ with respect to 
< for pure sequential, paral lel~sequential and 
sequential=parallel processing, 

a) Let t be a surjective trajectory. Then the 


set {x (0) e000 (2 (Q-=1)} fA Ne(z) is called the 
preprocessed set of ¢ with respect to t for pure 
sequential processing 


is independent 


b) Let ¢ be a non-surjective trajectory, 


i) Suppose ze range(t). Then, the set 


q, we call cell | 
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(fr (0) +6450 (| ranger)? (2-1 Y 


(z" - range(t))) A Ne(z) is called the pre- 
processed set of ¢ with respect tot for parallel 
sequential processing. 


ii) Suppose ¢ ¢ range(t). Then the pre- 
processed set of 5 with respect to t for parallel 
sequential processing is 9. 


c) Let t be a non-surjective trajectory. 
i) Suppose z © range(t). Then the set 
oe ee ee Pree 
{r(0) 0650 C(t] poige( 2) (¢)-1)} 4 Ne(z) is 


called the preprocessed set of % with respect to 
t for.sequential-parallel processing. 


ii) Suppose ¢ ¢ range(t). Then the pre- 
processed set of ¢ with respect to t for sequen- 
tial-parallel processing is range(t) 4 Ne(z). 


Briefly, the preprocessed set of Z with 
respect to t in the various methods of processing 
is just the collection of neighbors of ¢ which 
were processed before f&. 


DEFINITION 4: We call a local transformation 


o:S4 +S surjective of degree k/q, for0<k<q, 
if, by varying the values of any k arguments of 
Oo, we can produce as output every element of S, 
regardless of the values of the other q-k argu- 
ments. That is, ee I< I, < as Ip <q, 


the function o#: sks. § defined by 
O#(x, ,eee,x, ) = Oly) ,00+,¥,), where, for 
1 'k q 


le j<q,y. =x, 


J J 
Verse Sif j @ {i,,--.,i,}, is surjective. 


if je Cip,ees,i,t, while 


DEFINITION 5: Cell ¢ is said to be a related 
neighbor of cell € if there exists a chain of 


6. =&, and, for 1 <i <¢ ml, dye Ne(S..))- 


DEFINITION 6: Suppose p ec G p(G, ) (6, 5) (G, ar 


The seed set of p, SS(p), is that et Se focal 
transformations o such that p = par(c) (p = 
seq_ (9) for some tT) (p = par-seq, (c) for some 


t) (p = seq-par,_(¢) for some rt) 


Various Notions of Simulation 


In this section we examine numerous notions 
of the simulation of one tessellation automaton 
by another. One general definition of simulation 
which we will use is that of A.R. Smith [ 8]: 


DEFINITION 7: Let TA = <S,Z",NI,GT> and TA* = 


<S*,Z" ,NI*,GT*> be two n-dimensional stratified 
mixed mode tessellation automata. For t,r 2 I, 

we say that TA* simulates T in t/r times real 
time, if there are effectively computable Injec- 
tive mappings A:CON + CON* and [:GIT > CIs’, such 
that, for any c ce CON and <PyseessP7 € GT”, we 
ave: 
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A(L (2-00 p(c))--6)) =o8(.--(0FA(c))).-.), where 
<0 Tp 200 PP =I (<0 1,600 >). Iift=re= 
that TA* simulates TA in real time. 


1, we say 


The type of simulation we examine first is 
called strict simulation, 


DEFINITION 8: Let TA = <S,Z",NI,GT> and TA* = <S%, 


2” NI, GT! be two n-dimensional stratified mixed 
mode tessellation automata which are such that 
S* = S and NI* = NI. We say that TA* strictly 
simulates TA if, in Definition 6, t=rz=1,A 

is the identity map, and, forp e GT, 

$S(o) a SS(T(0)) # de 


In exploring this notion of strict simula- 
tion, we are really just trying to determine when 
global transformations of the form par(c), 
seq, (0), pabrsede, (0) and sas are equal 


for various o, Ty¥ Toe Tae We thus concentrate 


on the latter formulation. 
Our first result, being fairly obvious, is 
presented without proof, 


THEOREM 1: For o:S% > § and Ty» Tos T3 trajector- 


ies of the appropriate type, a sufficient condi- 
tion for {par(o), Seq, (c »par-seq,. (oc), 

2 
seq-par,_, (o)} to be ate equal is that for 


each cell Z € rag 
respect oe (1) 


the preprocessed set of © with 
(t T 3) for sequential (parallel- 


sequential) (sequential-parallel) processing con- 
sist entirely of independent neighbors of z~ with 
respect too. 


The converse does not hold as the following 
example demonstrates, 


Let n= 1, NI = <-1,0>, S = {1,2,3}, and 
o(1,1) = o(2,1) = o(3,1) = o(3,2) = 1, o(1,3) = 
0(2,3) = o(3,3) = 3, and o(1,2) = o(2,2) = 2. 


It is easily verified that par(c) = 
eed (2) zs pare seq, \0) a aes forall 


appropriate trajectories, while o is neither 
independent of its first nor second argument. 
We do have the following, though, 
THEOREM 2: For o:$7 +S and T1s Tos T3 trajector- 
ies of the appropriate type, suppose that {par(o), 
seq, (co), par- seq, (co), seq- par, (c)} ‘are pair- 
2 


wise equal. Letting SET. be the peo cnee set 
< 3, 


suppose that for each € « SET, , o is surjective 


of cwith respect to Tes for some l < 


of degree a,/q, where, 


& 


= q-|Ne(E) A Ne(z)[-[Ne(ze) a UY 


Ne(6) |. 
& SeSET, ~{E} 
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Then, each element of SET, 
neighbor of &. 


PROOF OF THEOREM 2: It is easily seen that if 
Ot > 0, then there are at least a, cells in the 


is an independent 


neighborhood of € - not including — - which are 
neither in the neighborhood of ¢ nor in the 
neighborhood of any other cell in SET, . Since 
o is surjective of degree a./q, by varying the 
states in these Op cells, we can force cell — to 
be in any state in S before we process cell f. 


Consider any initial configuration c. Suppose the 
next state of cell z is o, if c is processed in 


a parallel mode. Thus, the next state of cell 4 
is aE if c is processed in any of the other three 


modes, regardless of the states of the cells in 
SET.. Since each cell in SET. can be put in any 


state, our result follows. 
QED 


»[Ne()] > 1. 


COROLLARY I: Let ¢ © Z" and ey 
- {c¥, the local 


Suppose that for each & e« Ne(t 


‘transformation o is surjective of degree #19 > 0 


where a, = q = |Ne(E) a Ne(z)|. Then, there 


exist appropriate trajectories Ty» Ty, Tz such 


y 

that {par(o), seq. (c), par-seq. (c), 
ia tty 

seq-par,, (c)} are not pairwise equal. 

PROOF OF COROLLARY 1: Suppose that for all appro- 

priate trajectories Ty» To, Tz, we have that 


{par(c), as ee pareede 0 sae 


are pairwise equal. Choose trajectories tt, oe 
and 73 such that, for each 1 < i < q, there is a 


cell ¢; such that c.ty, is the only element in 


the preprocessed set of ¢. with respect to TT 


t>, and T3. By Theorem 2, o-4y., for l <iq, 


is an independent neighbor of %. with respect to 
o. Thus, either o. is a constant function or the 
next state of any cell depends just on its pre- 
vious state. Thus, o is not surjective of degree 
a-/q > 0. 
= QED 


From Theorem 1, we see that if trajectories 
Ty» T),T3 Can be found such that for each applic- 


ation of the local transformation o to a cell &, 
each cell in Ne(&) is not in the preprocessed set 
of — with respect to Ty» HsT3, OF, if it is, is 


either an independent neighbor of — with respect 

to o or is in a state which never changes, then 

par(o) = seq_ (co) = par-seq_ (c) = seq-par_ (c). 
Ty T T3 


We now look into the question of finding such 
trajectories. For this question, we restrict our 
attention to the finite configurations and a 
sub-set of them called the fixed configurations. 
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Our definition of strict simulation is likewise 
restricted to strict simulation over finite or 

fixed configurations by restricting the domain 

of A to these sub<sets. 


DEFINITION 9: A state se S is called quiescent 
with respect to o:Sl> S$ if o(Ssiut45) = 55. OA 
Reet fra” 


q 
configuration c is called finite with respect to 


o:Sl> S$ if only a finite number of cells of c 
are non-quiescent with respect to o. 


DEFINITION 10: A state s e« S is called null with 


respect too:S7+ S$ if c(t) = s «+» par(c) (c) (2) = 
s. A configuration all of whose cells are ina 
null state is of size 0. A configuration c is of 
size mz 1 if, when the absolute value of any 
coordinate of a cell ¢ is larger than m, then 
c(z) is null. A configuration is called fixed of 
degree m if m is the least integer such that c is 
of size m. The window set of a fixed configura- 
tion of degree m> 0 is the set consisting of 
those ceils which are such that the absolute 
value of each of their coordinates is less than 
or equal to m. We denote it by 'W_', The window 
set of a configuration all of whose cells are in 
a null state is @. 


Of course, all configurations which are 
fixed of degree m are finite. 


LEMMA 1: Let zg, Ro, «+51, & € Z and & isa 


related neighbor of ¢, and let E, = {<¢,z> | 


Cee hs Then R_ VE. 
relation. 


THEOREM 3: Let o:S% > $ and NI = <yj,....4 >> 


€ z". A sufficient condition for 


is a transitive, reflexive 


for Lypoeees 
{par(o), seq. (c), par-seq._ (co), seq-par_ (o)} 
to be pair-wise equal over fixed configurations of 
degree n > O for some trajectories TyTosT3 is 


that no two different cells of Z” are related 
neighbors of each other. 


PROOF OF THEOREM 3: We claim that R, u E is a 
partial order over z”. From Lemma 1 Riv EL is 


transitive and reflexive. Suppose oy (R. Ve) ) 
and 4, (Rv E.) «. By our hypotheses, it is 
impossible to have oy R, & and & Ra oye Thus, 
we must have that 2) = So. Thus, we have our 
result. This, in turn implies that (Riv EA 
(Ww x W) is a partial order over W_. We then 
embed (R, Vv E.) n (We x W in a total order 

@- Thus, for all o, ~eW, & R. oo > 5) 8 Soe 
nae: a such 


Let <x. > be a listing of W. 
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that for l1<u <v < - Note that 


u Vv 


for l<u<v <a’, c. ¢ Ne(t, ). For, 
V 


suppose &. € Ne(c ). Thus, 5, is a related 


V uU V 
which 


neighbor of . and hence oa R, &. 


jj ? 
u V u 
ae 
Si 

V u 


implies that an » which, in turn, implies 


that Jos Jy which is a contradiction. 


Thus, for pure sequential processing, we 


can choose any trajectory Ty such that, If 
Kp Se ty) ey. ayo) Sey. @ for 
jy jy 


e e n ° e 
I< j,.j, < m, then j, < Jy. 

For parallel-sequential processing, we can 
choose any trajectory To which is such that 
either, 

1) the range of Tt, doesn't include any 


element &, for l<je< m’: or 


2) the range of T. includes each element 


of se for some 1 < j < m", is 


J 


disjoint from {G. a? and is such that 


porerSe 


_ m 
if kj < k, to (ky ) = » (ky ) = ce. 
a ‘Jy 
j, then Jo < Jy: 


, for 


ls jyodo 8 


For sequential-parallel processing we can 
choose any trajectory T3 which is such that 
either, 

1) the range of T3 doesn't include any 
element &, for l<j< ms or 
J 
2) the range of T3 includes each element 


of ey pee te } for some 1 < j < m', is dis- 
m 
joint — ty pese,t, } and is such that if 
j-l 
ki < ky, T 36k, - be T3 (ky) =o. , for 
Jy dg 


i < Jy2J5o < nm", then J, < jy. 


QED 
We now give a necessary and sufficient con- 
dition for two different cells of Z" to be 


related neighbors of each other; but first we 
show, 
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LEMMA 2: Suppose NI = “Tyrrel for Yyrererty 


members of Z".Then cell Be " a related neighbor of 


cell 9, fort #9, iff > Yo wes, HOY. 

° its, is 
shiedercsadadee te au, #0 
for 1< ks p. . . 


PROOF OF LEMMA 2: Suppose gis a related neighbor 
of 6. Thus, there is a chain of cells eee 


such that &) =&> &, 
ls ig s-l, €; e Ne(E..,). Thus, &) = * XK? 


a Lege Een] - E* Yk? for l< Ka aeees 
K. <q. We then get that CS at Cece Ce 
2 3 


Pees OF are + E? where 


> 0 and Yj #0 forlsk<«p. 
k 


for s 3 2, 8, and, for 


Eo * E, 


lepe<q,6, 
Ik 


Our result follows trivially since &, = & and 
ae 


Suppose f= 56, 


+ eee + 6 Y. 
54, 


+ 8 for 


some 1 € p € q, where 56. > 0 andy. 
Jk “Ik 


1 < k < p. We thus get that © + y; © Ne (8), 


@ + 2y. e Ne(O + Y; )oweay Oo Yj 
ae eee 
p Jp 
é—e Ne(O0 + dO. ¥y. 
(0 4) 
p “p 
-1)° 
p-1 
+6, + 


is a related neighbor of 


e Ne(0 + 


ae 
p-1 p /p 


+6. 7. ¢€ Ne(6 + (6, 

pp 7 : 

Y. +6, Y ror Se Ne (8, “I)y, 

jp-t Jp dp se 

ee ee eee 
Jp Jp 

a QED 


Hence, © 


We now present, 
THEOREM 4: Suppose NI = SYy ree 9 ¥g? for Vyrcees 
ie e Z", Then there exists 2 different cells, 


G and 6, which are related neighbors of each 


other iff “eh, Pog gign SE Oe ay = 0 for some 
Ss °S 
2*sS €q, where a, > 0 andy, #0 for 1 <i <s. 


k. . 
i I 


PROOF OF THEOREM 4: Suppose 2 different cells, & 


and 9, are related neighbors of each other. By 


Lemma 2, 0 = 5. Y. + eee +5. ¥Y, +9 and 
te p Jp 
S9=B y. Feee FB ¥Y FE for 1 € piu <q, 
5 iS) fu tu 
where %, > 0 and Y. #0 for] i © p, and 
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B > 0 and te #0 for 1 < i € u. Thus, 


Cc = 5. yY, 


+o. tOLY:; 
JS] . 


Jp Jp 


+ ©, and our result follows. 


| 
io) 
=-h 
\e) 
ban } 
Wn 
1©) 
=| 
@ 


Suppose ® Y, + 
as s Ks 
© q, where 


k. Le 
i = s. Thus, letting ¥ =a 


tome 
N 


# 0, we see that 0 =a 


=QQ yYy + coe Fo, Y 
kok, kok, 


2, 0 and ¥ are related neighbors of each other. 


+ 0. Thus, by Lemma 


Thus, we have, 
COROLLARY 2: Let o:S7> S and NI = <Y,,...,Y > 


for Y,,.6.,Y_ © Z". Then, a sufficient condition 
+] Lg 


for {par(s), seq, (9), par-seq, (5), 
2 


seq-par, (9)} to be pairwise equal over fixed 
configurations of degree m> 0, for some trajec~ 
tories TystosT 3? is that, 
s Ss 
for 2 € s € q, where Vy #0 forl<i<¢s, 
i 
implies that a4, =... = 7 = 0. 


We can generalize this in the following 
fashion, 


DEFINITION 11: Let o be a local transformation. 
We say that pure parallel processing using o is 
weakly equivalent to pure sequential processing 


using o, over a set of configurations C, 
par(c) ve seq(c), 


there exists an appropriate 
seq_ (co) (c). 


We have similar definitions for the other methods 
of processing. 


if, for every ce C, 
trajectory t, such that par(c)(c) = 


We then have, 
COROLLARY 3: Let o:$7 +S and NI = SYp eee sg? 
for Yprrrerlg & z. Then, a sufficient condition 


for {par(c), seq(c), par-seq(c), seq-par(c)} to 
be pairwise weakly equivalent with respect to the 
set of finite configurations is that, 


O, | ae Qt = 0 
| k, tk, + + | kk, = 


for 2 < s <q, where VY, #0 forl<i<s, 
i 
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implies that G Fess =a = 0. 

1 s 
PROOF OF COROLLARY 3: This follows directly from 
Corollary 2 and Theorem 3. QED 


Let us now present an example of a neighbor- 
hood index and local transformation 0 such that 
par(s) # seq.(o) for all trajectories tT. We let 


NI = <-1,0,1> - each cell is in Z - § = {0,1} and 


3(1,0,0) = 0(0,0,1) = 1 , 0(0,0,0) = 9(0,1,0) = 
0(0,1,1) = 9(1,0,1) = 0(1,1,0) = o(1,1,1) = 0. 
Consider the configuration c = 010010; that is, 


there is 


and c(i) = 
that par(c) # seq.. (c) for any trajectory T. 


an 19 € Z such that c(i 0) = c(i or3) = 


0 otherwise. It is easily series 
It 


is also easily verified that this example does 
not meet the sufficient conditions mentioned in 
the previous theorems. Here we see clearly what 
the concept of related neighbors portends; there 
are cells which are related neighbors of each 
other and which are such that whichever one is 
processed first destroys the possibility that 
the other will be in a state so that the resul- 
tant configuration is the same as if the original 
configuration were processed in parallel. We call 
this the mutual ly destructive condition. If this 
' condition exists in a given situation, we have 
that par(c) # seq, (co) for any trajectory tT. The 


preceding theorems gave sufficient conditions 
for this condition not to exist. 


We now present some theorems regarding the 
general notion of simulation. But first, we must 
define the following entities, 


DEFINITION 12: A stratified mixed mode tessel la~- 
tion automaton is called pure parallel if and 
only if the union of its sets of sequential, 
parallel-sequential, and sequential-paral lel 
global array transformations is empty. 


DEFINITION 13: A stratified mixed mode tessella- 
tion automaton is called pure sequential if and 
only if the union of its sets of parallel, 
parallel-sequential, and sequential-parallel 
global array transformations is empty. 


We now have the following, 


THEOREM 5: Let TA = S842" 551 5000 s¥ o> 26P- Then, 


there is a pure sequential stratified mixed mode 
tessellation automaton, STA, which simulates TA 
in 2 times real time. 


PROOF OF THEOREM 5: 


Case 1: Suppose ve #0 forall le jeq 


Let aie Sa }; tM ccna de 
BT og = {0 onpeees oP pce gt and 
GT. 2"  Prasteed ss reste 0, = fa) .a7 83, 
a,3 and 6, = {b, bot, where 9,9 S= 08,9 S = 6. 


> 


We then let STA = <S xX S$ x (8, x ake 
ZN ,<Yy oe 0e sVqo0? 6TH, where GT* will be defined 
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presently. Before we specify what A and TI are, 
let us do the following, 


We first define a map M:GT_v GT_ wu GT 
ened p s p,s 
Uv GT. +{t | t: we Z", t is 1-1 and onto} as 


follows (this will be the set of trajectories 
used in STA): 


a) Forp ¢ cT let M(p) be arbitrary 

b) For pe GT., p 
trajectory t*. Let M(p) 

c) For p € GT 


p,s 
. Let M(p) be any trajectory such 


is determined by some 


Tx 


» P is determined by some 


trajectory t** 
that, 


i) Suppose 5&5 are two different 
cells of Z" in the range of t** and that 


rex | (a) < ce! (z5), Then, 
(M(o))""() < (H(0))7! @). 


(The order of sequential processing is preserved.) 


ii) Suppose Z € range(t**), 


& @ range(t**) and —& e€ Ne(t) or z& ce Ne(&). Then, 


(M(o))7"(z) < (M(o)) 7 1(z). 


(A cell processed sequentially in TA may be pro- 
cessed any time in STA as long as all cells in its 
neighborhood or which have it as a neighbor, which 
were processed in parallel in TA, are processed 
first. 

d) For p e« GT 

Sp 

trajectory t***. Let M(p) be any trajectory such 
that, 


, p is determined by some 


i) Suppose ~1,Z, are two different 
cells of Z" in the range of r*** and that 


“l(y) < TRxK "(a Then, 
(M(o))7' (z,) < (m(0)) (zo). 


ii) Suppose z #¢ range(t***) , 
— e range(t***) and — ¢ Ne(z) or & € Ne(&). Then, 


(M(o))"(e) < (M(o)) 7*(z). 


(A cell processed in parallel in TA may be pro- 
cessed any time in STA as long as all cells in its 
neighborhood or which have it as a neighbor, which 
were processed sequentially in TA, are processed 
first. 


leslesie 


For l<j< 


< restttu, let 1) = M(p 5). 
> CONT). 


= c*, where, for 


We now specify the map A:CON, 
Let c ¢ CON,,. Then, A(c) 


te z”, if c(z) = s, then c*(z) = <s,8,<<a, 
| 1 
exes »b. >>>, where, 


Peestttu | res+tu 
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a) $ is a fixed element of $ 


b) For 1 < j < r, we have i = 1, for 
r+] < j < rts, we have = 2, for rtst]l < j < 


rtst+t, we have i, = 3, and for rtstttl < j < 
rtst+ttu, we have i = 4. (This indicates the 


total range of processing: parallel, sequential, 
parallel-sequential and sequential-parallel.) 


c) For 1 < j < rtstttu, b, indicates 
whether or not cell ¢ is in the range of the 
trajectory corresponding to Die That is, for 


I< j < r, we let I, = 1, while for r+l < j < 
rtstttu, Fr = 1 if and only if cell ¢ is not 


in the trajectory which determines an 


As a notational convenience, for 


ao = »b >> € 


a »bd. 7 yee0 5d. i 
restttu 


| | St tbu 
, let a[k] = <a 


, \Pestt-tu 
(0, x 85) 


» obs->y for 
IE I 

1 < k< rtstttu, and a[k][1] =a 
b. e 

k 


; and a[k] [2] = 
k 


Now, the map [':GT > GT#2 


T(o) = 


is defined by 


<py 20> » where we now BenINe es ences 


Suppose p = Pas for 1 < i € rtstttu. Now, p 
is defined via some local transformation o33 We 
have piat7ee is defined via the trajectory tS 
M(o) and local transformation o% , where, 

] 


o* (<s 8 9% 7 pece <5 8 9 >) = 
Py ae | qt]? gtl? gtl 


coy 
A 
Wn 


qt! 20, (Sisee Sor) Od? if either 


l< i<r, or rtstl < i < rtstt and 
Moe) LET [2] = by» or rtsttt] < i < 


rtsttt+u and O41 ft f2] = by 


<o , if either 


o Syreeers 


qt) Soe] rt ie 
r+l < i < rts, or rtstttl < i < 
rtstttu and aoa LENTZ = by 

a > if 


46° ($7, 304g5".4) 76 24 oe 
] qtl’ ?"qtl 


p qt 


rtst] < i < rtstt and O41 Lil [2]=b,, 


where, for 1 < k < qtl, 


il 
ao 


: ts, if a Li} [2] 


Sk 


Foot oat We Uh aeons Wate anes Ween Woe Sao aoe ates tt aon Tee tt ane one Wane Wane aan tee ace 
il 
(oom 


{ 
{3 if a, [i] [2] 
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W 


We also have that p5 is defined via local 


transformation o* 
where, P92 


and an arbitrary trajectory, 


o* (<s,,S ,0y> <s S41, .9>) = 
Po Pry ree Mgt LP gt] gt 


{ A ‘ ‘ = 
(<Sgat Sgt qe? i f O gay CHT [2] = by 


A ° ° = 
SS ge Sat] 90 ty? if Oe Lil [2] = by 
Case 2 Yj = 0 for some 1 < j' <q 


This case is similar to the above. 


It is straightforward to show that STA 


simulates TA in 2 times real time. 
QED 


Before we present our last theorem, we 
define the following concept, 
DEFINITION 14: Let TA = <Sj2" 457) 000 5¥ > GTP. 
Suppose Te z" and p € GT_vU GT UY GT - We 
i Ss P,5S S»P 
now define what we call the type of cell ¢ with 
respect to p, denoted by, 


t lA 
YP sp 


as follows, 
a) Suppose p ¢€ GT_. Let T, be the trajectory 
which determines p. Set So = {*}, and, for i 3 0, 
7 q ol 
let Si,, = 5,4 Si. For Os jet, (c), we 


define ef) 2 > Si (Thus, © (2) is the constant 


function which maps z" into {*}.) ForO<k<« 


wo! (g)=1 and — ¢ Z", we define, 


(c) (c) ne 
{<e =" (E+y,) ,.26,6 > (E+y_)> if 
Mie "Tq 
keT ‘2 { g= t, (k) 
{ 
fe (2) (c) otherwise 
TA (c) 
Then, type = ¢ = (z) 
cP t'() 
b) Suppose p e€ GT, 5° Let a be the trajec- 


9 
tory which determines o. Set To = {*,<%,...,%>}, 
q 


q 


and, for i 2 0, let Te] = T. v Ty. 


Suppose that ¢ # range(t.). Then, 
type = <* 


Suppose that Z e range (t,). Let 


\ @ 


: | : 
= For 0 V( def 
; pl range(c,) or Os jeu, (¢), we define 
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ate) 7" aa We then have the following, 
THEOREM 6: Let TA = <S,2",<Yy 9000 50> 9GT>. For 
*& 4 4 each p € GT_v GT GT th 
a2) = if ee range (t  ) p : p,s¥ ST, > Suppose at 
~ {<*,...,*> otherwise , 
g |{type," | te Z'}| < x. 
fot®) (Eey,) 500-5042) (Etx,)> Then, there is a. pure parallel stratified mixed 
(2) ( — { mode. tessellation automaton which simulates TA 
KET SS! ~ { if € = T (k) in real time. 
{ (cy, PROOF OF THEOREM 6: This is similar to the proof 
fay=" (8) otherwise Saas 
of Theorem 1 in Grosky and Tsui[ 3]. 
Then, type, = a (2) (z) QE D 
: u, (c) 
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PARALLEL IMPLEMENTATION OF A TWO-DIMENSIONAL MODEL?! 


Valere J. Kransky, E. Dick Giroux, and Gary A. Long 
Lawrence Livermore Laboratory, University of California 
Livermore, California 94550 


Abstract—A large, serially programmed, 
two-dimensional mathematical model has been 
reprogrammed for the CDC STAR-100 and the 
CDC 7600 computers using parallel programming 
techniques. The parallel program is currently 
running on the CDC 7600. The concepts, tech- 
niques, and the results of its use are discussed, 
The parallel program executes efficiently, can 
be modified easily, and requires no major re- 
design or reprogramming for conversion to other 
large-scale parallel machines, 


Introduction 


The Lawrence Livermore Laboratory 
began seriously investigating the programming of 
"pyarallel'' machines in 1969. Our group was 
assigned the task of reprogramming a large, 
two-dimensional physical simulation model 
called HEMP [1]. The equations are Lagrangian 
and the difference scheme is explicit. Included 
in the model are hydrodynamics, elastic-plastic 
flow, multiple sliding, multiple materials, and 
fracturing. We established the following pro- 
grammatic goals: 

(1) To formulate parallel programming 
techniques and methods for general uSe., 

(2) To develop a program that would 
execute with the same source deck on different 
types of computers. (This is particularly im- 
portant at LLL because of our history of ac- 
quiring new types of large-scale computers.) 

(3) To achieve optimum execution rates on 
parallel computers. 

(4) To design the program in a manner that 
would provide maximum flexibility for frequent 
modifications. 


Vector Programming 


After analyzing several different large- 
scale parallel computers (See Appendix A), we 
decided that vector programming techniques 
would satisfy our needs. We define a vector to 
be a contiguous array of data whose boundaries 
are specified by a descriptor word. The data 
contained in a vector may be: 


(1) floating point 
(2) integer 

(3) bits 

(4) bytes 

(5) characters 


A descriptor is a pointer whose low-order bits 
are a bit-base address that points to the data and 
whose high-order bits contain the item count of 
the data set. 

The ease with which one can manipulate 
data is the essential feature of vector program- 


3 


ming, We can manipulate vectors with such 
operations as: 

(1) Compress - selects a subset of a vector 
under the control of a bit 
vector. | 

(2) Merge - puts together two vectors 
under the control of a bit 
vector, 

(3) Compare - generates bits in a bit vec- 
tor as a results of com- 
paring two vectors, 

(4) Transmit 

index list - collects into a contiguous 
result vector, discontig- 
uous elements from 
another vector by using an 
index vector. 

(5) Transmit 

index des- 
tination - stores into discontiguous 
locations the contiguous 
elements of another vector 
by using an index vector, 
Such instructions as these permit the 'massag- 
ing" of data for the various equations found in 
large-scale scientific programs, 


Vectorization of the HEMP Equations 


The HEMP problem-solving procedure 
consists of repeated Solutions of explicit equa- 
tions over a large, two-dimensional grid, Each 
complete pass through the equations for all grid 
points (nodes) and zones is a "problem cycle." 


Nongeneral Calculations 


Certain parts of the two-dimensional mesh 
(see Fig, 1) must be treated in special, non- 
general ways in the solution of practical prob- 
lems. Three of the more important of these are 
described below: (The problem shown in Fig. 1 
is not a typical HEMP problem; most problems 
are more complex and much larger.) 

(1) Most of the physical system calculated 
by the HEMP program include more than one 
type of material. The materials are in contact 
with each other at interior boundaries. Often, 
large displacements along these surfaces take 
place as the system is solved on the computer. 
In the program this necessitates the inclusion 
of special "slide-line'’ calculations and logic to 
Simulate the surfaces with a decoupled grid. 

(2) There are usually two or more mate- 
rials ina HEMP problem. The behavior of 
these materials is modeled by equations-of- 
state. The program must associate the proper 
equation-of-state with the appropriate grid zone, 
and calculate material behavior. 


Work performed under the auspices of the U.S. Atomic Energy Commission, 
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(3) Various boundary conditions are 
associated with the exterior boundaries of the 
system. These require that the program do 
selective calculations for certain boundary 
points. 


HEMP Difference Scheme 


The HEMP equations-of-motion require the 
calculation of a line integral at each node. This 
is represented by the dashed line shown con- 
necting nodes I, II, III, andIV in Fig. 2. In 
addition, zonal data must be accessed at zones 
(1), (2), (3), and (4). The exterior boundary- 
line integrals are calculated in a manner similar 
to that of the four-zone case, except that coor- 
dinates of the node being accelerated are also 
assigned to one of the surrounding nodes (see 
Fig. 3). 

It was determined that the movement of the 
boundary points, while subject to various non- 
general conditions, could be substantially cal- 
culated with the same equations (and therefore in 
the same vector) as the interior grid points. For 
the purpose of describing the vector techniques 
used in doing some of the calculations, a tiny 
grid with a slide-line is shown in Fig. 4. In 
order that we may treat all nodes with the same 
equations to obtain a "tentative'' acceleration, we 
expand the nodal vectors with a ''geometric bit 
string.’ By geometric bit string, we mean a bit 
string whose bit pattern is dictated by the grid's 
shape and size. This expansion creates a vector 
that has vacant elements for the insertion of 
''phony node'' values, The expanded grid is 
Shown in Fig. 5. Through the use of compres- 
sion, expansion, and controlled-store operations, 
the phony nodes are assigned the values of the 
adjoining real boundary nodes, The zonal 
quantities are expanded out in a similar manner, 
Now we have a grid that includes phony nodes and 
phony zones. 

Compression with appropriate geometric 
bit strings is done to isolate the diagonal end 
points. The diagonal differences (which are 
zOnal-centered quantities) are calculated. These 
diagonal differences are compressed with 
another set of geometric bit strings to produce 
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Fig. 1. A HEMP problem. 
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nodal-centered values. These are used to cal- 
culate the acceleration terms. New velocities 
are calculated that are used to reposition the 
nodes, 

The acceleration terms are needed for the 
boundary calculations (including slide-lines); 
therefore, it is efficient to calculate acceleration 


III 


Fig, 2. HEMP acceleration arms, 


Fig. 3. 


HEMP boundary arms, 
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terms in one vector pass. For boundary points, 
the position is only tentative and may be over- 
ridden by subsequent calculations. 


The Slide-Line Calculation Logic 


Slide-line calculations are complex. They 
require that nodes and zones on each side of the 
slide-line be associated with nearby nodes and 
zones on the opposite side of the slide-line. 
Figure 6 shows how zones must be mapped 
across a Slide-line. This relationship can 


FEE 
HE 


Fig. 4. A Simple HEMP grid. 
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Fig. 6. Slide-line mapping. 
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change from problem cycle to problem cycle. A 
search procedure is require to determine this 
relationship. This was at first thought to be an 
inherently serial process, and therefore not 
amenable to vector programming procedures, 
We vectorized this procedure so that it is done 
in a few iterations, through the use of cascading 
compare and compress operations. | 

An "ordering index" vector is calculated 
and saved from cycle to cycle. This vector 
describes the relative nodal positions at that 
cycle. During each cycle, the ordering index 
vectors are updated to reflect positional changes, 
To update the ordering index numbers, all nodes 
on one side of the line are checked against their 
previously known solution points on the other 
side to determine if those solutions are currently 
correct. The currently correct nodes are com- 
pressed out of the vector. A trial ordering 
adjustment is made with the reduced vector. If 
found to be satisfactory, these solutions are 
compressed out. This iterative procedure is 
continued until all solutions are found, The 
relative positions of the slide-line nodes change 
little from cycle to cycle. Ordinarily, one to 
three iterations are required to update all the 
ordering index numbers. This process quickly 
cascades from full-length slide-line vectors to 
much shorter vectors, Although they are more 
involved, subsequent slide-line calculations that 
use these ordering numbers cascade in a similar 
manner. 

Slide-line manipulation includes the build- 
ing and use of dynamic bit strings. These con- 
ditional bit strings are used to compress a 
Sequential index set that is used to fetch or store 
elements of data within the slide-line vectors. 
Slide-lines are relatively short, and we may have 
several slide lines in a problem, Therefore, 
they are catenated together so that all slide- 
lines car be calculated in one vector pass. Since 
the acceleration equations are the same for both 
sides of the slide-line, alternate sides are cat- 
enated together so that all common parts of the 
calculation can be done in one vector pass, 


Equation-of-State Handling 


Each problem can have associated with ita 
number of equations-of-state. In practice, the 
Same equation-of-state is associated with many 
contiguous zones, This enabled us to: 

(1) select zones with like material proper- 
ties, 

(2) arrange the zonal variables into 
material-related vectors, and 

(3) calculate similar zones in one Series of 
vector operations, 

A particular zonal grid vector is composed of 
packed integer fields. One field is a group of 
numbers that are associated with a particular 
equation-of-state form, Another field isa 
material number within that form. The material 
number within the form is used as an index to 
access equations-of-state coefficients within that 
form. When the material properties are to be 
calculated within a problem cycle, this vector is 
unpacked (using vector operators) into a number 
of full-word vectors. A vector compare is done 
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to determine which zones are associated with 
each equation-of-state form. The appropriate 
variables are then compressed out using the re- 
sulting bit string. The corresponding material 
within the form numbers is also compressed out. 
The form number is used to control a branch to 
the appropriate equation-of-state coding. The 
material within the form number is used as an 
index to select the appropriate equation-of-state 
coefficients for the material. The program 
makes repeated passes through this procedure 
until all zonal material properties are calculated, 
Because the program is provided with a list 
of forms for a given problem, only as many s 
passes are made as there are forms in the 
problem, 


Forking 


The equation-of-state calculations are ex- 
amples of program fork handling. (By forks we 
mean the selection of various calculational 
operations, Ina serial program this would be 
done by conditional branching.) Many forks in 
the program are done dynamically (dynamic 
forking). In general, a particular vector com- 
pare produces a different bit string each prob- 
lem cycle. The bit string is used to control the 
calculations. We use two methods of control 
logic. One method is the previously mentioned 
vector compare-compress-calculate-expand- 
and-store series of operations. There is over- 
head in doing the compressions and expansions 
in this method, A second method is to use full- 
length (uncompressed) vectors through both 
sides of the fork, and then use a bit string(s) to 
control the storing of results. Here we are 
calculating many results that are going to be 
unused, and therefore wasted. Whether to use 
the compress-expand method or the controlled- 
store method depends on the bit density of the 
fork bit string and the amount of calculation on 
each side of the fork, 

When the bit string is relatively sparse on 
the long side of the fork, it may be more effi- 
cient to compress, calculate, expand, andstore. 
When the bit string is relatively dense on the 
long side of the fork, it may be more efficient 
to calculate the entire vector both ways and use 
the controlled store, The method to use is 
determined through the use of an equation that 
has in it the vector lengths, the operation types, 
and the number of operations on each side of 
the fork [2]. The decision is made dynamically 
each problem cycle. (This calculation is 
practical because our vectors are long, and 
some forks require many operations on one or 
both sides of the fork,) 


Operation Skipping 


The issuance of one vector instruction 
produces a large number of results. This has 
introduced another time-saving flow-control 
technique that is not available in serial pro- 
gramming— operation skipping [2]. Some of the 
HEMP equations contain terms that are not used 
in a particular problem. In serial programming, 
it is more expensive to check a flag and possibly 


skip an operation each time through a loop than 
it is to issue the unnecessary instruction(s). In 
vector programming, a Single flag check can 
cause a Sequence of vector instructions to be 
skipped, saving hundreds or thousands of unre- 
quired operations, This test can also be done on 
the length field of a vector descriptor, 


Character Vector Techniques 


The HEMP program produces large 
quantities of printer output. To make this an 
efficient process, we have used character vector 
operations to convert binary data to BCD 
(binary coded decimal). This is one application 
of vector techniques to areas other than 
arithmetic number crunching,(a 


Tree Structures 


Some index sets, bit strings, and other 
data sets are constructed at generation time; 
others are built dynamically during execution. 
Because of the wide divergence of HEMP prob- 
lem sizes, Shapes, and options, the use of fixed 
blocks of memory to store this data would be a 
waste of storage space, To conserve core, we 
pack this data in memory. We access this data 
through a series of linked descriptors or ''tree 
structures" that point to the data. The top 
descriptor points to a vector of descriptors, each 
of which points to another vector of descriptors 
or data. Each descriptor tree eventually points 
to data. If an unusually large data set is re- 
quired, it takes the needed space for that prob- 
lem only. If a data set is not required for a 
particular problem run, it consumes no memory. 
Tree structures are uSed in the slide-line, the 
boundary condition, and other sections of the 
program. A simple example is shown in Fig, 7. 


Core Allocation 


Allocation of storage for all vectors needed 
by the HEMP program is done dynamically, at 
execution time [4]. The program never allocates 
more vector space than it needs and/or is 
physically available in core, The allocation of 
core is based on the contents of a HEMP data file 
(Appendix C). 


Temporary Results Vectors 


The evaluation of a typical vector 
arithmetic expression requires temporary 


°’Charcter vector operations facilitate the 
writing of interactive, timeshared, and inter- 
pretive routines, Character vector techniques 
can be applied to compilers, loaders, and other 
system software packages [3]. 


(b) For some problems the entire grid cannot 
be held in memory. The program explicitly 
handles the transfer of data between core 
memory and disk, Even though a computer 
(like the CDC STAR-100) may have virtual 
memory, the overhead associated with page 
faulting is too costly. 
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Level 0 fixed-location descriptor 


Level I variable-location descriptor 


Level II variable- 
location data 


Data 
Level II Data 


variable-location descriptors 


Level III 


Level III variable- 
location data 


Data 
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to 
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Fig, 7. A tree structure, 


vectors of the same length as the result vector. 
The length of result vectors in a typical 

HEMP problem is about 1500 words long. As in 
serial programming, certain calculational re- 
sults must be saved for later use. In serial 
programs, this does not present a memory 
management problem, since each saved result 
only needs one word of core. In vector 
programming, this is a serious problem. Each 
saved result is a vector that requires a large 
amount of core memory, To alleviate this 
problem, we reuse the same dynamic vector 
Space as much as possible. This is done 
through the use of a simple "'saved vector" 
allocation scheme, 

The base addresses of saved vectors are 
kept ina stack. The base addresses of "saved 
bit vectors" are kept in another stack. Initially, 
the base addresses in a stack are in ascending 
order. The number of words between any two 
adjacent entries in a stack is the same (the 
length of the longest result vector needed), 
When a calculation needs a result vector, it 
takes the next entry (address) from a stack, 
When a result vector is no Longer needed, the 
address is returned to the stack, 
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Multiple Vector Passes 


When discussing the equations, it was 
assumed that all vectors were full grid size. 
This was a Simplistic view, taken to make the 
discussion easier to understand, In actual 
practice, a problem must be calculated by 
making multiple passes through the equations. 
This is necessary because current computers 
do not have enough core memory available for 
the save-vectors to be the length of a full grid 
vector, The number of passes through the 
equations is a function of the maximum Size of 
a saved result vector and the size of a grid 
vector, 

Prior to each pass through the equations, 
the grid variable descriptors are adjusted to 
point to that part of the grid that is to be cal- 
culated. If a slide-line(s) is included in a pass, 
the necessary data vectors and vectors of 
descriptors for the slide-line equations are con- 
structed. The coordinate vectors and the 
velocity vectors are merged with phony nodes, 
The geometric bit vectors are also dynamically 
constructed each pass,(a 


Vector Programming Aids 


The implementation of our vector program 
has been facilitated by the use of: 

(1) an APL interpreter (Appendix D), 

(2) programming language extensions 
(Appendix E), and 

(3) new debugging routines (Appendix F), 


Current Status 


The vector HEMP program is currently 
running on the CDC 7600 through the use of 
vector software kernels (Appendix G). Vector 
HEMP demonstrates marked improvement in 
execution rate over the serial FORTRAN pro- 
gram (Appendix H). The same vector HEMP 
source deck that is in use on the CDC 7600 will 
be used on the CDC STAR-100 computer 
(Appendix I). 


Summary and Conclusion 


The following vector techniques were 
developed and used: 

(1) geometric bit strings 
phony nodal and zonal elements 
dynamic bit strings 
static forking 
dynamic forking 
operation skipping 
cascading vector solutions 
character vectors for printing results 
descriptor tree structures 


The vectorization of large scientific computer 
programs is accomplished by complete redesign 
and reprogramming. In general, improve- 
ments in execution rates will not be achieved by 


3 


We have a full set of Boolean bit vector 
operations to facilitate the construction of the 
geometric bit strings [5]. 
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simply "vectorizing"' a few subroutines. Vector 
programming techniques can be successfully 
applied to a wide variety of large-scale parallel 
computers, 


Appendix A. Parallel Computer Analysis 


The pursuit of our goals necessitated a 
detailed analysis of parallel computers, By 
parallel we mean any computer on which a 
single operator at the source level will cause 
multiple, identical machine operations to occur. 
The operators may invoke a single hardware 
instruction or a sequence of instructions. 

The parallel computers studied included 
multiprocessor computers, array computers, 
pipeline computers, and associative computers. 
Our attention was focussed mainly on large- 
scale computers in existence or in the planning 
stage. The computers we investigated were: 

(1) the CDC STAR-100 [6] | 

(2) the Burroughs ILLIAC IV [7] 

(3) the Texas Instrument ASC [8], [9] 

(4) the CDC 7600 [10] 

The STAR-100 and ASC computers use "pipes" 
through which operands from contiguous memory 
locations are streamed, The ILLIAC IV uses 

64 separate processing elements (PE's) that can 
all execute the same instruction simultaneously. 
The STAR-100, ASC, and ILLIAC IV all use "bit 
logic" to control the storing of operands to 
memory. For bit logic operations, the STAR- 
100 has a much more complete set of instruc- 
tions than either the ASC or the ILLIAC IV. In 
parallel computation, bit logic replaces the 
indexes used in serial programming and is the 
most important nonarithmetic capability of the 
computers, 

The CDC 7600, while not a pipeline or 
multiprocessor computer, can be an efficient 
vector machine through the implementation of 
software kernels (Appendix G). 


Appendix B. Vector I/O Library 


We used character vector techniques exten- 


sively in writing a vector I1/O library. All I/O is 
done by subroutine calls to this library. 

The HEMP source deck contains no READ, 
WRITE, etc. type of statement. Only a small 
part of this library (that part which interfaces 
directly with the operating system) has been 
written in a machine-dependent manner, One 
Subroutine in this library is used for printing 
vectors of numbers. This vector 'write'' routine 
can execute up to six times faster than serial 
FORTRAN "write’ routines on the CDC 7600. 


(b) Because of the extreme disparity in the 
calculational speed between a truly vectorized 
algorithm anda calculation done in a loop ona 
parallel machine, it does little good to vectorize 
just part of a program and leave the rest in 
serial mode. If parallel machines are to per- 
form at anywhere near their capability, all 
array-type calculations must be vectorized. If 
arrays are calculated serially, the performance 
of parallel computers will be degraded by 
factors of 10 to 30. 


Appendix C, HEMP Data File 


A HEMP data file is composed of three 
parts (Fig. 8): 

Part I contains various scalar information 
about the problem and the size (number of words) 
of Part II (this size changes from problem to 
problem). 

Part II contains descriptor tree structures 
and data vectors. The data vectors in Part II 
contain information about: 

(1) the size of the grid, | 

(2) the number of grid variables (the 
number of variables varies from problem to 
problem), 

(3) the order of the grid variables, 

(4) the attributes of the grid variables 
(i.e., nodal, zonal, etc.), 

(5) the boundary conditions, 

(6) the slide-line surfaces, and 

(7) the equations-of-state. 

Part III contains the grid variables, them- 
selves, 


Appendix D. APL Interpreter 


LLL's APL interpreter was used heavily 
during the design and algorithmic development 
phases of vector HEMP programming. Many 
data manipulation concepts were checked out in 
APL. The design of the very complex slide-line 
algorithms was particularly aided through its 
use, Without APL this would have been a much 
more difficult task. The value of APL is due 
primarily to three things: 

(1) its interactivity, 


512 words long and contains the 


Part I 
oe length of part II 


Descriptor tree structures: 
(1) Grid description 
(2) Boundary description 


Part Il! (3) Slide surface description 


(4) Equations-of-state description 
(5) Data for structures 


Grid data 


Part III 


LST 


Fig. 8. HEMP data file. 
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(2) the fact that it includes in its operation 
set most of the vector operators, and 
(3) its extensive debug features [11] - [13]. 


Appendix E, Programming Language Extensions 


The source deck for the HEMP program is 
written in LRLTRAN [14] and [15]. LRLTRAN is 
a Super-set of FORTRAN IV. LRLTRAN has 
scalar and vector extensions to FORTRAN IV. 
The scalar extensions most used were: 

(1) The .LOC,. statement. Given: 

J = .LOC.X, then J would contain the absolute 
core location of the variable, X. | 

(2) The PARAMETER statement, Given: 
PARAMETER (LWDB = 60), then all occurrences 
of the name, LWDB, in the source deck would be 
replaced by the literal, 60. 

(3) A MACRO processor. We only use the 
character substitution part of the MACRO proc- 
essor, 


Vector Language Extensions 


We used the following vector extensions in 
LRLTRAN: 

(1) VECTOR (DV1, V1)— declares V1 to be 
a vector and DV1 to be the descriptor of vector 
vil. 


(2) BIT B1 VECTOR (DB1, B1)— declares 
B1 to be a bit vector and DBI1 to be the de- 
scriptor of Bl. 

(3) CALL Q8CMPRS — generates code for 
the vector compress instruction. 

CALL Q8MERGE—generates code for 
the vector merge instruction, 

CALL Q8XPND— generates code for the 
vector expand instruction, 

CALL Q@8MASK — generates code for the 
vector mask instruction. 

(4) The .CTRL. operator. V1 = B1.CTRL. 
V2 says store V2 into V1, under the control of 
bit vector Bl. 

(5) CALL Q@8INLINE(op-code, argument 
list for the op-code). Op-code is the STAR-100 
hexadecimal operation code, and the argument 
list must match the fields for the operation as 
defined in the STAR-100 reference manual [6]. 

The compiler generates inline coding for 
the STAR-100 for the vector operations, For 
the 7600, the compiler produces calls to soft- 
ware kernels for vector operations. 'The source 
deck for the HEMP program contains only dyadic 
expressions, This was done primarily to 
minimize allocation of scratch vector space for 
complicated equations, 


Appendix F. Vector Debugging Routines 


When debugging serial programs, octal (or 
hexadecimal) and/or decimal dumps are suf- 
ficient, Vector programs require more sophis- 
ticated dumping procedures, We wrote a sub- 
routine, VDUMP, to print ''snapshots" of core 
while running a problem and a utility routine, 
VDUMP, to do post-mortem dumps. Both 
routines will dump in the following formats: 

(1) bit (pure binary, ones and zeroes) 

(2) ASCII 
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(3) descriptors (on the 7600, they print the 
octal word address as well as the octal bit base 
address, and the length field in base 10). 

(4) floating point 3 

(5) hexadecimal 

(6) integer (base 10) 

(7) octal 
The routines will also dump vectors of all of the 
above formats. When printing a vector, the 
routines always print the descriptor first. 

Subroutine VDUMP will also trace a de- 
scriptor tree structure, printing all intermediate 
descriptors and the data vector at the end of each 
branch. The type of data at the end of a branch 
is determined by the subroutine and formatted 
accordingly. 

Utility routine VDUMP was written using 
character vector techniques, It executes about 
three times faster than our serial dump routine. 


Appendix G. Vector Kernels 


In evaluating large-scale parallel com- 
puters we reached the conclusion that they all 
could be considered to be ''pipe-line'’ computers, 
This makes it possible to emulate a sequence of 
arithmetic and/or logical computer operations, 
Efficient subprograms called kernels can be 
written if a computer has: 

(1) a reasonable number of registers, 

(2) several arithmetic units that can be run 
in parallel, and 

(3) partitioned memory, so that multiple 
memory references can be made at the same 
time. 

The 7600 lends itself to the vector kernel con- 
cept. During the design phase of the HEMP pro- 
gram, two coworkers (Frank McMahon and 
Lansing Sloan) were programming 7600 vector 
routines to improve execution speed of FORTRAN 
programs [16], [17]. We had already developed 
and simulated similar vector kernels for the 
ILLIAC IV in 1970, Coordinating with McMahon, 
we decided to emulate a subset of the STAR-100's 
arithmetic and bit-byte instruction set for the 


Table I 


Results per Microsecond 
Process 600 Ss TAR-100 


Unoptimized FORTRAN 1.2 - 1.9 9 


Optimized FORTRAN 1.6 - 3.3 ? 
Vector Operations (Dyads) 
Transmit 15 50 
Arithmetic 
(+, a) a /) 2 at 10 12.5 ae 50 
Compress 5 - 100 25 
Merge 4 25 
Boolean string 100 - 400 400 
Transmit index list 7 4 


Vector Operations (Triads) 
Products per Microsecond 


(V1 * V2 * V3) 10 25 
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7600, When using these vector kernels 
(labeled 'in-stack loops" or simply "Stack- 
loops'') exclusively, we have what we call a 
vector 7600. These stack loops are mostly 
dyadic operations (V1 = V2.op.V3), but some are 
triadic (V1 =[V2.op.V3].op.V4), where V1, V2, 
V3, V4 are all vectors, Dyadic operations on 
the 7600 achieve around seven floating-point re- 
sults per microsecond, while triadic operations 


attain around ten floating point results per micro- 


second, Vector execution rates are a function of 
the item count of the vector operations and the 
look-ahead techniques used to achieve complete 
concurrent CPU utilization, The stack-loops, 
like the STAR-100 vector instructions, require a 
fixed amount of start-up time. This start-up 
time becomes negligible for vectors of lengths 
greater than 400 operands. 

Table [ compares'the results per micro- 
second of the 7600 stack-loops, normal 
FORTRAN, and the STAR-100. 


Appendix H. Timing of Vector HEMP vs. 
er.al H - 


At present the HEMP program is running 
on the 7600 using the vector stack-loops. To 
date, timing comparisons show that the vector 
HEMP program executes 2.2 times faster than 
the serial FORTRAN HEMP program (Fig. 9). 
With additional programming improvements and 
the use of the vectorized editing routines, 
throughput factors of three are predicted. The 
approximate number of vector operations per- 
formed per pass through the HEMP equations 
are: 

(1) 950 arithmetic operations (including 
Simple data transfers), 

(2) 200 full-word logic operations (com- 
pare, compress, merge, etc.), and 

(3) 100 bit-string operations (bit and byte). 

Appendix I. Spanning Computers 

None of the vector language extensions 
appear in the HEMP source deck. All vector 
operations and descriptor manipulations are 
buried in Simple macros, We have different 
macro files for the STAR-100 and the 7600. 
separate macro files are needed because: 

(1) there are differences between the for- 
mats of STAR-100 and 7600 descriptors; 

(2) different PARAMETERS are used; 

(3) Some operations require multiple 
vector instructions on the STAR-100, whereas on 
the 7600 a subroutine is called. 

Another reason for limiting ourselves to 
dyadic vector expressions is the simplicity of 
moving the HEMP program to computers other 
than the STAR-100 and the 7600. A relatively 
Simple preprocessor would handle the macro 
expansions and the parameter substitutions, The 
resulting deck would then be FORTRAN IV- 
compatible (i.e., calls to software kernels would 
be done during preprocessing). 

The use of vectors on a machine like the 
ILLIAC IV eliminates the necessity of memory 
management techniques such as ''skewing" for 
optimum PE usage and boundary condition cal- 
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Vector program 


Serial program 


Points processed per millisecond 


0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 
Length of vectors in thousands of words 


Points per | Time for Number of 
millisecond cycle grid points 
4.45 1] 50 
Serial 4.25 94 400 
4.20 399 1600 
2.75 18 50 
Vector 8.50 47 400 
9.30 180 1600 


Fig. 9. 


Timing comparison of vector-vs.-serial 
7600 HEMP program. 


culations. The use of vectors also results in 
very little wasted memory, since the memory is 
packed. For someone who is accustomed to 
sequential (serial) programming, vector pro- 
gramming presents new challenges. However, 
our experience at LLL shows that if the equations 
of a model are appropriate to the use of vectors, 
they can be programmed in a Straightforward 
manner, 
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A PARALLEL ASSEMBLER FOR ILLIAC IV 


J. M. Randal 
Computing Services Office 
University of Illinois 
Urbana, Illinois 61801 


Summary is indistinguishable from the existing serial 
one, even though its functions are spread over 
two machines. The paper catalogues other reasons 


One of the difficulties envisioned in run- for undertaking the project. Two principal ap- 
ning a computer of the power of ILLIAC IV, is proaches that enhance parallelism in an assembly 
that of keeping it adequately supplied with a process, that of arranging the source code in 
stream of ready-to-run jobs. This paper reports the machine so that it is most amenable to paral- 
on the progress made in providing an assembler, . lel attack, and the delaying of as much semantic 
compatable with one already provided and running analysis as possible as long as possible are 
on a Burroughs B6/00, that runs on ILLIAC IV. outlined. The paper goes on to describe how 
Through detailed timing and functional simuation parallelism is achieved for each stage of the 
an assembler has been produced which assembles assembly process, an the measured amounts of 
correctly executable object code at, at least parallelism are compared and discussed. The 
300,000 cards a minute, virtually replacing the paper concludes with a few observations on the 
need for an "assemble-load-and-go" phase by an practicality of parallel compilation of higher 
"assemble-and-go" phase 100 times faster. From level languages and other so called “inherently 
the users point of view the parallel assembler serial" processes. 
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PROCESS COMMUNICATION PREREQUISITES OR THE IPC~SETUP REVISITED 


Michael J. Spier 
Software Engineering Department, 
Digital Equipment Corporation, 
146 Main Street, 
Maynard, Massachusetts 01754, USA 


Abstract -- A careful examination of any ex- 
isting inter-process communication (IPC) mechanism 
invariably uncovers the underlying existence of a 
more fundamental IPC mechanism, which in turn is 
built on a yet more fundamental IPC mechanism... 
etc. 


This study resolves this indefinite recursion 
of a self defining mechanism by proposing a certain 
causality, expressed in terms of a finite list of 
process communication prerequisites, and based on 
a non-mechanistic postulate which calls for an area 
of communication (or matlbox) that is by its very 
nature impervious to mutual interference by the 
communicating processes. 


Given arbitrary processes for which these pre- 
requisites hold, we may logically construct the 
"very first" elementary IPC mechanism, i.e., the 
one which is not dependent upon its own pre-exist- 
ence. Such a mechanism is developed in this paper; 
it is capable of transmitting a single, one-way, 
one~bit message among processes. 


It is suggested that the proposed causality, 
although arbitrary in many ways (and openly admit- 
ted as such) may serve as a convenient intellectual 
tool with which autonomous sequential processes may 
be observed and studied. 


inter-process communication, IPC, IPC- 
Setup, mailbox, mutual exclusion, pro- 
cess, synchronization. 


1.3 4.32 
INTRODUCTION 


Keywords: 


CR Categories: 


The Inter-Process Communtcatton Setup (or IPC- 
Setup for short ‘*)) is an initial communication 
which establishes the conventions by which two or 
more asynchronous sequential processes agree upon 
a pattern of harmoniously cooperative behavior. 


The concept has been introduced in a previous 
paper [11] where it was incidental to the main 
subject. Subsequent reflection has convinced me 
that this concept deserves a much more thorough 
investigation. I have observed that the imple- 
mentability of any given inter-process communica- 
tion (IPC) mechanism is contingent on the previous 
availability of a more fundamental IPC mechanism 
(e.g., in order to implement a producer/consumer 
buffered communication mechanism [5] we need some 
mutual exclusion functions such as P and V [7], 
which in turn require a mechanism to guarantee 
their internal indivisibility in time, which in 
turn.....etc.) The recursion seems indefinite. 


Consider the following causality, which dis- 
plays a most perplexing dilemma. In order for two 
processes to synchronize themselves (e.g., using 
Dijkstra's P/V functions) they must have had some 
(a) 

The term [?C-Setup was originally coined by 

Elliott I. Organick. 
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previous communication to establish the semaphore's 
identity as well as their agreement to make proper 
use of the synchronizing primitives. Thus, 


~- In order for processes to communicate, they 
must synchronize themselves, 

~ In order to synchronize themselves, they must 
have had an earlier communication, 

- Which implies a yet earlier act of synchroniza- 
tion, 

- Which had to be based on a yet earlier act 
of communication, 
Which.... 


That the dilemma is not practically insurmount- 
able is amply demonstrated by the various func- 
tional IPC mechanisms that we know of. Evidently, 
at some basic level (typically the hardware level) 
the dilemma was resolved through an arbitrary act 
of Gordian-knot cutting (typically hardware- 
provided mutual exclusion). Experience has shown, 
however, that whenever the nature of processes 
changes (e.g., by the transition into virtual 
time) lower level synchronization machinery may 
no longer be valid. When we attempt to design a 
multi-level processing system, with nested levels 
of (virtual) parallelism where each successive 
act of (virtual) processor multiplexing increas- 
ingly removes us from our hardware base, it is we 
who have to provide the Gordian-knot cutting serv- 
ice at appropriate levels of implementation. As 
we implement successive layers of abstraction, the 
complexity of our underlying machinery increases. 
Whereas at the hardware level of a uni-processing 
computer we achieve the desired mutual exclusion 
through the simple act of interrupt inhibition, 
at a much higher level of implementation we may 
have to consider the properties of virtual proc- 
essors, the effects of invisible page fault inter- 
ruptions, the effects of an externally generated 
user interrupt (when an interactive user presses 
the attentton key), etc. 


Also, to deal with two mechanisms which are 
defined in terms of one another is intellectually 
very frustrating. We may have to accept the 
“chicken or egg" dilemma when confronted with the 
real universe, we may wish however to have firmer 
intellectual control over our artifacts (e.g., 
computers, computer processes), at least in the 
sense of establishing a certain causality whose 
fundamental postulates are external to the observed 
mechanisms. It is the purpose of this study to 
suggest such a causality, in the form of a list of 
conditions which have to be true in order to guar- 
antee ” arbitrary processes (m senders and n-m 
receivers) the ability to exchange a single, one- 
way, one-~bit communication. This intellectual 
exercise has one ground rule: no pre-existing 
underlying mechanism is admitted, lest it contain 
a hidden IPC mechanism and thus leave us no further 
advanced than before. Therefore, I shall discuss 
implementation-independent abstract processes. 
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A valid question to be raised is: "..why 
worry about processes which are external to the 
computer?" As Naur [10] points out, we are crea- 
tures of habit and have the inherent tendency to 
visualize concepts in those terms with which we 
are most familiar. Being computer professionals, 
we intuitively think of process in the context of 
executing computer program, it being implicitly 
understood that computer translates to "hardware 
level machine". As operating systems become more 
sophisticated and the hardware base hidden by 
intermediary levels of abstraction, our earlier 
simplistic notion of "process" may no longer hold, 
indeed become an intellectual impediment. Any 
insight gained into the properties of the 
implementation-independent abstract process will 
however hold true. 


In the following unconventional view of non- 
computer processes, I have guided (indeed biased) 
the development towards those kinds of processes 
with which we deal within the confines of the 
sequential digital computer, and added computer- 
derived examples to illustrate specific points. 


WHAT IS A PROCESS? 


Webster's Dictionary succinctly defines the 
term "process" as "Something going on". By 
selectively narrowing down our choices from this 
initial vague definition, we can derive an accept- 
able definition of "process" as it applies to our 
field of interest. 


Let us think of process as being the mantfes- 
tatton of Time, tn Space. The universe in which 
we exist is subject to the Flow of Time so that 
it presents itself under different configurations 
at different points in Time. I apply the term 
"process" to some time-dependent evolution from 
one configuration to another. We might visualize 
the universal set of processes as "threads" of 
"control" indefinitely stretching from the past 
into the future, hopelessly intertwined beyond 
human comprehension. 


In order to make sense out of them, study and 
even manipulate them (e.g., within the confines of 
a sequential digital computer), we must selectively 
choose -- among the universal processes -- those 
specific evolutions in Time ("threads") which we 
deem worthy of consideration. Thus, I choose to 
declare "process" to be a subjective quality, 
existing only in the eyes of the observer, who 
explicitly ignores all other peripheral "threads" 
in order to avoid confusion. 


Examples of such humanly selective observations 
may range from the macroscopic level, exemplified 
by the Astronomer contemplating the birth-and-death 
process of suns and galaxies (or even the, to us, 
ultimate process of the universe's expansion and 
contraction), through the Historian tracing the 
evolution of Mankind or the lifecycle of nations, 
down to the microscopic level of the Quantum 
Physicist observing the incredibly short lifespan 
of some sub-atomic particle. 


An observer has to choose for himself not only 
that one specific "thread" which is of interest to 
him, but also the tnterval in Time between two 
successive observations (named "grain of time" in 
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[8],[{11]). I attribute the necessity for a sub- 
jective choice of intervals to the human brain's 
limited capacity to assimilate details, and I 
suggest that there exists a certain "Subjective 
Time Flow" within our minds, in terms of which 
sequential processes are best visualized. 


We may assume that the human brain cannot make 
sense out of a visualized process if that process 
consists of too many discrete details, and that 
for the sake of coherency the subjective process 
contains only a limited number of them. Thus, 
when a human observer translates an evolution in 
Real Time into an evolution in Subjective Time, 
he typically chooses intervals between observa- 
tions which are proportionate to the observed 
process's period of existence. 


And effectively, the Astronomer chooses his 
interval in terms of billions of years, while the 
Physicist's interval may be expressed in terms of 
billionths of a billionth of a second. Yet, 
within the minds of both these observers, their 
respective processes may unwind at the same sub- 
jective rate of speed, covering a similar number 
of discrete observations, and may abstractly be 
related to one another. 7 


Given the periodic nature of observations, the 
process can no longer be made literally analogous 
to a continuous thread; rather, it is better rep- 
resented as a discrete sequence of dots which are 
laid along the axis of the imaginary thread, where 
each dot corresponds to an observation and where 
spaces separating the dots correspond to the time 
intervals between successive observations. 


The human observer typically chooses to ignore 
the existence of the intervals, which to him are 
irrelevant, and to pretend that the dots are 
effectively adjacent to one another. Consequently, 
the discrete sequence of observations may artifi- 
cially coalesce once more into a humanly coherent, 
subjectively unbroken thread. By eliminating the 
real time intervals, we effect the translation of 
the process's evolution into the flow of Subjective 
Time. Compare this to Dijkstra's notion of 
"ordered markers on a scaleless time axis" 


[5]. 


Relating the above to our area of interest, 
namely the study of those processes which exist 
within digital computers, we see that the notion 
of "process" is still a subjective quality, depen- 
dent on the human observer's choice. A process 
may, for example, consist of some high level lan- 
guage (e.g., FORTRAN, ALGOL) program where obser- 
vations relate to source language variables and 
where the interval of time between two successive 
observations is known to span one or more hardware 
cycles, while from the (interactive or batch) 
user's point of view a process may consist of a 
series of system commands, the interval between 
which may comprise one or more high-level language 
programs . 


Lastly, a process may be sequenttal or non- 


sequential (b) Briefly, the former denotes a 
(b) 
In "Cooperative Sequential Processes" [5], 
Dijkstra illustrates the distinction between 
sequential and non-sequential processes. 
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process in which the interval between two succes- 
Sive observations is assumed to consist of a 

Single logical evolutionary step, while the latter 
denotes a process in which the interval between 

two succesSive observations is assumed to consist 
of a compound logical evolution. The difference 
between the two is largely subjective and I believe 
it is safe to state that the non-sequential process 
has the property that any of its changes of state 
may be decomposed into a number of parallel sequen- 
tial processes. 


In this paper I adopt the point of view that 
the sequential process is the "elementary" kind of 
process. I shall henceforth ignore the 
non-sequential one by simply choosing to observe 
my processes at those points of their evolution 
where they display a single logical change of 
state (concerning this arbitrary choice of perspec- 
tive, the reader is referred to observation #1 
further on). This choice coincides with our pro- 
fessional custom to consider the computer's fetch- 
decode-execute cycle as a truly sequential progres- 
sion, even though they might consist internally of 
two parallel overlapping execute current tnstruc- 
tion while fetching the next one operations, or 
even though at a more elementary level the entire 
computer is known to be implemented as a highly 
complex parallel hardware logic. 


This paper, then, restricts itself to the 
study of observably sequential processes. 


PROCESS DEFINITION PARAMETERS 


For the purpose of this intellectual exercise, 
I wish to study processes which are known to exist. 
The following definitions apply to subjectively 
observed processes which may or may not inhabit 
the insides of a digital computer. Therefore I 
have chosen intentionally to ignore the processor 
stateword concept whose hardware-level definition 
is clear whereas its implementation-independent 
definition would not substantially add to our pre- 
sent abstract discussion. The following definition 
of the term process as it applies to a selectively 
observed "thread" is borrowed from a previous 
publication [11]. 


A process 1s a discrete progresston, in Time, of 
dtscerntble changing states. 


Though correct from the abstract point of view, 
this definition may not prove of great practicality 
when it comes to the consideration of computer 
processes. In order to relate it more closely to 
professional terminology, I introduce the term 
memory space (named "state variable set" in [8]) 
to denote the set of variables whose changing 
states may be observed: 


A memory space, subjected to the Flow of Ttme, 
presents ever changing configurations of discern- 
tble states. 


An additional helpful concept which allows us 
to attach somewhat more of a tangible substance to 


the "time" abstraction is that of the processor (c) , 


(c) 


The term is used in its most abstract conno- 
tation, and must not be taken literally in 
the meaning of "hardware CPU". 
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The processor is an abstract "execution agent" 
(comparable to Johnston's "clerk" [9], and to 
Dennis' & Van Horn's “locus of control" [4]) which 
activates an ordered sequence of modifications on 
the various components of the memory space. If we 
can hypothesize a memory space which is unaffected 
by the Flow of Time (e.g., the memory of an in- 
active computer), we can define the processor as - 
being: 


A catalyst capable of subjecting a memory space 
to the Flow of Time. 


With the help of these two terms, we may now 
devise a definition of "process" which is much 
closer to our professional terminology: 


A process ts the activity of a processor 
within a memory space. 


The memory space may assume various aspects, 
and depending upon its nature the contained vari- 
ables may be discretely identifiable, or not. By 
intentionally biasing the discussion towards the 
kind of processes in which we are interested, I 
shall postulate for our convenience that the vari- 
ables with which we deal are discretely addressable 
by means of universally unambiguous identifiers, 
or names. In the following, the term matlbox name 
is used in the connotation of "universally unambig- 
uous identifier of a memory space component of the 
type mailbox". 


Returning once more to the universal processes, 
we can envision their flow in Time as individual 
intertwined threads, where one particular thread 
represents our process of choice. This thread 
reaches both backwards and forwards into infinity. 
It would be useful to delimit the extremtttes of 
that portton of the thread which we actually hold 
under observation. I would thus add the following 
two parameters to the definition of a process, 
these being its creation time and its termination 
time, corresponding to the extremities of the 
thread-portion pointing towards "past" and "future", 
respectively. 


For example, we may consider a human being, 
going through his daily routine, to be a sequential 
process. Evidently, his dates of birth and death 
are relevant parameters in the definition of such 
a process (if only to preclude any notion of the 
feasibility of communicating present-day computer 
science concepts to the late Charles Babbage). 


Following is a list of parameters applicable to 
a single observably sequential process. By assign- 
ing values to these parameters, we may talk more 
precisely about some specific process: 


The tutervals between successive 


Parameter #1: 
change-of-state observations (4d), 


Parameter #2: A memory space comprised of all the 
variables which may be affected by 
the processor. 


A process creatton time at which 
the combination processor/memory 


space becomes meaningful. 


(2) vote that while the intervals need not be of 
uniform size, their rough order of magnitude is 
a relevant parameter. 


Parameter #3: 
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Parameter #4: A process termtnatton time at 
which the combination processor/ 
IMemory space ceases to be 
meaningful (e). 


I shall also refer to the combined parameters 
3,4 as the process's lifespan. 


Observation #1: I wish to emphasize the fact 
that the previous definitions and parameters are 
arbitrarily. chosen in order to provide a useful 
handle on the kind of processes in which we are 
interested. From the absolute point of view, 
both definition and parameters are highly ambig- 
uous. Consider: the parameters relate'to a 
portion of a thread which is our chosen process. 
From the larger thread's perspective, the above 
"creation" and "termination" may be considered 
to be changes of state where the "lifespan" in 
between is considered to be the tmuterval. Thus, 
we may state that a process ts a change of state 
and that consequently a process ts a discrete 
progresston, tn Time, of processes. This phenom- 
enon of recursive self-definition is a marked 
property of the general area of discussion. 
considering the process from a conveniently 
chosen subjective point of view, and by making 
some well placed arbitrary definitions and pos- 
tulates, we may gracefully extricate ourselves 
from this "chicken or egg" situation, which per- 
sistently manifests itself in the study of IPC 
mechanisms. Compare this to the discussion of 
“image processes" in [8]. 


INTER-PROCESS COMMUNICATION 


By 


Returning again to the universal processes, we 
may intuitively think of inter-process communica- 
tion as being an interaction of sorts between two 
or more threads. The term "communication" conveys 
the meaning of commonality, or togetherness. I 
postulate that processes cannot engage is communi- 
cation unless they already have something in 
common. I further postulate that such commonality 
must relate to the memory space component of the 
process. 


While arbitrary, the postulates make sense 
when we consider that the process consists of only 
two components, 1) the processor, and 2) the memory 
space. While commonality in Time (such as the 
co-existence of otherwise unrelated computers) does 
not -- in itself -- provide us with the ability to 
communicate, commonality in Space (such as connec- 
ting those computers to a common memory bank) 
definitely does. We can visualize the processes, 
communicating with one another by depositing 
messages in the common memory space and/or extrac- 
ting messages from it. Supposing that the memory 
space consists of a medium which lends itself well 
to the exchange of messages, we may state that: 


Processes communicate by exchanging messages 
tn a commonly accesstble medtum. 


Even though I shall henceforth employ the term 
mailbox (as suggested in [11]) to designate the 


locality in which messages are exchanged, I have 


(€) note that it is the meaningful association which 
determines the process's existence, and that the 
disassociation implies the termination of 
neither processor nor memory space. 
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expressly used the term "medium" in the definition, 


‘in order to emphasize the rather large variety of 


possible overlapping memory spaces. While the 
point is very obvious in non-computer communica- 
tions, e.g., one person talking to another (the 
medium being the surrounding air), it applies as 
well to certain less conventional instances of 
communication in the computer world, such as the 
radio link connecting the remote components of 
the ALOHA system [1], or the IMP's and transcon- 
tinental lines of the ARPA network [3], or simply 
the tapes or disk-packs which may be manually 
shuttled between independent computer installa- 
tions. 


Of the two process components, processor and 
memory space, I have chosen to dismiss the pro- 
cessor aS a possible vehicle for the elementary 
commonality. Is such a dismissal justified? 
Would an exactly synchronized rate of progression 
not provide a suitable basis for the communica- 
tion of two spatially-independent processes? My 
answer is an emphatic no! Two such processes 
which knowingly tick along in an exactly synchro- 
nized rate may each perform a function based upon 
the assumed concurrent activity of the other, 
however they do not communicate because each acts 
tndependently of the other's existence (i.e., one 
such process may be terminated without affecting 
the other's behavior, the survivor's activity 
continues even though its premise of concurrency 
no longer holds). Still, while a synchronous 
rate of progression is not in itself sufficient to 
form the basis for a communication mechanism, it 
may be usefully applied to an IPC mechanism based 
on memory space commonality, as shown in the last 
section of this paper. 


I therefore 
memory space is 


consider that commonality of 

the essential condition which has 
to be satisfied if processes are to communicate 
at all (£) . Processes whose memory spaces are 
exclusive are by definition incapable of mutual 
communication, in fact are said to be protected 
from one another [12]! 


COHERENT COMMUNICATION 


Our processes communicate by exchanging mes- 
sages in a commonly accessible mailbox. Depending 
on whether or not message depositions and extrac- 
tions happen concurrently (remember, at this point 


(f£) 


A question was raised about this last state- 
ment, and critics argued that in systems such 
as the RC4000 [2] processes communicate not 
through a shared data base but rather through 
the intermediary services of the system 
monitor. I wish to point out that the present 
study does not concern itself with the more 
complex ways of building IPC mechanisms, but 
only with the prerequisites for the most 
primitive "first" communication. Moreover, 
the fact that a user process's memory space 
consists in part of a protected portion of the 
monitor does not invalidate the fact that 
within that monitor there is a shared data 
base in which messages are exchanged; remove 
that message buffer from the RC4000 system, 
and inter-process communication is guaranteed 
to work no more. 
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we know nothing specific about these processes 

and their pattern of behavior, excepting the fact 
that they share a commonly accessible mailbox), 
such communication may be coherent or interfering. 
The communication is said to be interfering if the 
value extracted from the mailbox by some receiving 
process B is not identical to the same value pre- 
viously deposited in the mailbox by some sending 
process A. Interference (and the resulting message 
incongruity) may occur when several depositions are 
made concurrently, or when extraction is attempted 
while deposition is still under way. The commu- 
nication is said to be coherent when it is not 
interfering. 


Coherent communication is characterized by the 
fact that a message extracted from a mailbox is 
guaranteed identical to the same message previously 
deposited in the mailbox. We may not know who 
deposited that message in the mailbox, nor what 
it means, but we are assured of the congruity of 
that message. 


It is the coherent message which is of interest 
to us. A way to assure coherency must definitely 
be an important constituent of the process commu- 
nication prerequisites that we seek. I shall 
therefore further postulate that the mailbox 
itself, by its nature, possesses a property of 
guaranteed message congruity such that whenever 
two or more processes simultaneously attempt to 
either deposit a message in it, or extract a mes- 
sage from it, only one process at a time will be 
allowed to do so; the exact succession into which - 
this enforced sequentiality will be resolved is 
undefined and immaterial ‘9. Not knowing who 
created the mailbox with its magical property, nor 
how this property is: functionally enforced, we can 
only surmise that it is the handiwork of some 
benevolent instrumentality. Still, assuming the 
mailbox's availability, we may state that: 


Coherent tnter-process communication ts an 
interference free exchange of messages tn 
a commonly accesstble matlbox. 


In the following, I shall refer to the mailbox 
as being “interference-proof". The interested 
reader may wish to study the details of the ALOHA 
system [1], whose mailbox (i.e., a certain 
bandwidth of the electromagnetic spectrum) is not 
guaranteed to be coherent; ingenious encoding 
techniques reduce the probability of interference 
to a very low factor, but the fact remains that 
coherency is not guaranteed. 


Observation #2: Let us for the time being accept 
the premise of a mailbox which allows only ex- 
changes of coherent information, even though it 
is unclear how such a mailbox might be construc- 
ted. Later on I shall 1) postulate a very ele- 
mentary two-state mailbox whose implementability 
will not be subject to doubt, and 2) suggest that 
more elaborate mailboxes may be constructed with 


the help of the elementary one. 


(F) the necessity for such a mailbox (and its 
magical property of coherency) is a fundamental 
postulate of any multiple processor computer 
system; e.g., at any given time, memory is 
interlocked to all but a single processor. 
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MEANINGFUL COMMUNICATION 


Let us attempt to construct an initial model 
of communicating processes. For the sake of sim- 
plicity, I shall deal with two processes only, a 
sender and a receiver. The following is, however, 
valid for any number m of sending processes, and 
n of receiving processes. 


At this point, all that we may assume about 
our processes are the characteristics discussed 
earlier; namely, their sequentiality and their 
memory spaces which overlap a commonly-accessible, 
interference-proof mailbox. Two processes named 
A and B communicate as follows: 1) the sender, 
process A, deposits a message Msg in the mailbox; 
2) the receiver, process B, copies the contents 
of the mailbox into some private locality Z. The 
sender would perform 


matlbox := Msg; 
and the receiving operation would be 
: eae 


Even though the communication is coherent, it 
completely meaningless. Consider the following: 


matlbox; 


is 
1) By what right can it be assumed that process 
A has ever had the intention of depositing 
anything whatsoever in the mailbox? Assuming 
that it did have such an intention, 


2) Are processes A and B actually referring to the 
same mailbox? Is it not possible that process 
A innocently deposits its message in some 
mailbox, while process B persists on extracting 
an assumed message from some other mailbox,? 

We may graciously submit that the mailbox is 


one and the same, still 


3) Process B may be the speedy one, extracting an 
assumed message from the mailbox before the 
slower process A has ever had the chance to 
perform the intended deposition. And if we 


agree to discard this possibility as well, then 


4) Having received its coherent message, process 
B is no further advanced because it has no way 
of knowing what the message is supposed to 

ean A 


m 


If we wish to engage in meaningful communica- 
tion, we have to make sure that the above uncer- 
tainties are satisfactorily resolved. We may not, 
at this point, have any specific remedy; this need 
not deter us from describing the effect of such a 
solution by establishing a list of conditions 
which are essential to meaningful communication: 


Condition #1: The processes have to agree in 


advance (and that means prior to the creation 
(h) 


We may better appreciate this fact if we con- 
sider the task of the military cryptographer, 
faced with the decoding of an intercepted 
coherent enemy message; he is capable of success 
because he knows the other guy's language. 
Process B's task is hopeless, it knows absolutely 
nothing about process A. Consider the hopeless- 
ness of deciphering Egyptian hieroglyphs prior 

to the Rosetta Stone Discovery. 
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time of any of the communicants) on their inten- 
tion to communicate some time in the future. 
Remember, we still choose to remain in a state 


of blissful ignorance concerning these processes, 
thus the pre-natal instance is the only logically- 


safe point in time. 


Condition #2: They must agree on the exact iden- 


tity of the single mailbox in which messages will 


be exchanged. 


Condition #3: They must agree on their respective 


sender/receiver roles. 


Condition #4: Mandatory sequentiality has to be 

imposed on the act of communication. First the 

sender has to deposit his message, and only then 
may the receiver extract it from the mailbox. 


Condition #5: The communicating processes have 
to have agreed, in advance, upon the way in 
which messages are to be interpreted and under- 
stood. 


PROCESS SYNCHRONIZATION 


In the above list, condition #4 requires that 
the communicating processes adjust their relative 
speeds; as they progress independently in Time, 
when their respective instances of communication 
arrive, these instances have to become aligned in 
Time in a predetermined way. We use the term 
"synchronization" to denote such an alignment. 


We still know nothing specific about these 
processes, hence cannot trivially choose between 
alternate schemes of synchronization which may all 
seem @ prtort to be equally attractive. Possibil- 
ities may include 1) the sender having the ability 
to slow down the receiver's progression in Time, 
2) the receiver having the ability to cause the 
sender's speed to be accelerated, etc. 


A simple, though arbitrarily chosen, scheme to 
assure that message extraction will happen later 
in Time than message deposition would have the 
receiver process voluntartly enter a waiting state 
if the message has not yet arrived. This method 
is chosen because it lends itself best to the 
kind of process synchronization practiced within 


digital computers, and is hence typical of existing 


computer program IPC mechanisms. Its adoption 
requires that we add two more conditions to our 
list. 


Condition #6: The receiving process is capable 
of determining at any given moment whether or 
not a message had actually been deposited in the 
mailbox. 


Condition #7: If a message had not yet been de- 
posited, the receiver must be willing, and cap- 
able (!), of suspending its progression until 
such time when the message has arrived. 


This introduces one last complication. Condi- 
tion #6 calls for the process's ability to inspect 
the mailbox's contents and determine whether or 
not a message had arrived. Presumably it will do 
so by testing the mailbox for some specific value 
which may be either a non-message or a message 
value. What can that value be? If the receiver 
tests for a non-message value, it is not possible 
that the sender has innocently used that very same 


value for its message and thereby mislead the 
receiver? Or if the receiver tests for a message 
value, is it not possible that the mailbox 

might -- by some unfortunate chance -- have been 
pre-initialized to that very same value thus mis- 
leading the speedier receiver into acceptance of 
a supposed communication, when in fact no such 
transaction has yet taken place? We must there- 
fore complete our list of conditions with the 
following two: 


Condition #8: The communicating processes have 
to have agreed on a single non-message value 
Vintt to be interpreted as "no message has yet 
arrived" (by agreeing on a non-message value, 
we leave the door open for a possible variety 
of meaningful message values). 


Condition #9: The mailbox is guaranteed to have 
been initialized to the non-message value Vintt 
prior to the creation time of any of the commu- 
nicating processes (again, within the present 
context of discussion, this is the only 
logically defensible point in time). 


PROCESS COMPATIBILITY 


Having established the need for process syn- 
chronization, we must preclude from our consider- 
ation those processes which are -- by virtue of 
their temporal characteristics -- inherently in- 
compatible with one another from a synchronization 
point of view. Of the process definition param- 
eters, the tnterval and the Ztfespan may assume 
values which would make the processes incapable 
of meaningful synchronization. I wish to remind 
the reader that this paper does not engage in 
the exercise of process construction, but in the 
observation of already existing processes. Thus, 
the three incompatibilities listed below are valid 
so long as we recognize our inability to influence 
the processes' temporal parameters (i.e., we pre- 
clude from our consideration artifacts such as 
“clocking processes" [8]). 


Incompatibility #1: The processes' lifespans 
may be exclusive; one process's termination 

time may ‘have passed well before the other 
process's creation time has yet arrived. This 
case was exemplified by the earlier mentioning 
of Charles Babbage. This condition is asymmet- 
rical in that the expiration of the sending 
process might still be acceptable (i.e., Charles 
Babbage effectively did leave a message for 
posterity) whereas the premature expiration of 
the receiver is obviously inadmissible. If we 
postulate that the elementary communication 
mechanism that we seek should be indifferently 
functional for any sender/receiver configuration 
(i.e., should allow any two or more processes to 
adopt either role) then we have to insist that 
the processes' lifespan overlap the period of 
communication. 


Incompatibility #2: The processes' lifespans 
may overlap the period of communication, but 
only partially so that the sender's termination 
time arrives before it has had the opportunity 
to properly conclude its part of the transac- 
tion. This may cause the receiving process 
indefinitely to suspend its progression in 
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anticipation of a message whose deposition was 
never satisfactorily carried out. For processes 
to engage in guaranteed non-fatal meaningful 
communication, the sender's termination time 
must lie well outside the period of communica- 
tion, known as the critical section in the 
process' lifespan [5]. 


Incompatibility #3: The szze or relative order 
of magnitude of the processes' respective inter- 
vals must be compatible. It is difficult to be 
very specific about the exact kind of interval 
compatibility that is desired; the reader must 
have noticed by now that the main thesis of 

this paper consists of emphasizing the very 
nebulous nature of the overall subject. 


Nonetheless, this is a very real problem best 
exemplified by the inability of a virtual pro- 
cessor, executing within a paged virtual memory, 
to correctly service real-time applications. 

The interval between two successive virtual 
machine cycles is undetermined, while the corre- 
spondent real-time process requires guaranteed 
service within specific time bounds. 


ELEMENTARY COMMUNICATION MECHANISM 


Let us now construct the "first" and most ele- 
mentary communication mechanism which would sat- 
isfy all of the requirements mentioned earlier. 

The processes are assumed to be inherently suitable 
for mutual communication in the dual sense of over- 
lapping memory space and temporal compatibility. 
Concentrating on the communication mechanics alone, 
we are faced with one major difficulty which is 
the creation of the interference-proof mailbox. 


It is possible to construct a very primitive 
mailbox which has the capacity for a single bit of 
information only. The domain of the mailbox is 
thus restricted to two possible values which we 
shall name TRUE and FALSE. By nature of its de- 
finition, the mailbox could never be found ina 
state which is neither TRUE nor FALSE and it 
therefore fulfills our requirement of inherent 
coherency. 


If we assume that such a mailbox was originally 
created by some benevolent tnustrumentality (i), 
placed in the common memory space and thoughtfully 
initialized by the instrumentality to the FALSE 
state, then we may establish the following scheme 
for communication, where a sender process sets the 
mailbox to TRUE, and where the receiver process 
interprets the TRUE state as meaning "a message 
has arrived". 


Also, the receiver process would interpret the 
FALSE state of the mailbox as meaning "a message 
has not yet arrived". The receiver may now suspend 
its Logical progression by insistently testing the 
mailbox for a TRUE state. The mechanism would 
work as follows: the sending operation corresponds 
to 


matlbox := TRUE; 

(2) che computer hardware designer who provides us 
with an interlocked memory, or even with 
hardware implemented semaphores [7], is a 
good example of what I would call "“instrumen- 


tality". 
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Observation #3: 


while the receiving operation is of the form 


busyloop: 
IF matlbox = FALSE THEN GOTO busyloop; 


It is important to note that 
while the receiving process's progression in 
Time is by no means affected, we have achieved 
the functionally desired effect by imposing on 
that process a rule of behavior which guarantees 
that its memory space is subjected to no further 
modification while the matlbox = FALSE condition 
prevails. As mentioned in the last section of 
this paper, computer processes have the highly 
interesting property in that their Flow of Time 
may be literally stopped and restarted. 


The above mechanism is the most rudimentary 
imaginable, capable only of a single one-bit one- 
way (or "Simplex") communication. By reciprocally 
using two mailboxes and by inverting the processes' 
sender/receiver roles, we may construct a mechanism 
capable of sending two single one-bit messages in 
opposite directions (known as "duplex" communica- 
tion channel). Combinatorial usage of many such 
mechanisms allows us to construct a "multiplex" 
channel, or a “bus" (parallel simplex channels) 
as encountered in the innards of computers. The 
information transmission capability of the ele- 
mentary mechanism is very poor. Each mailbox may 
be used only once, and the existence of the mes- 
sage is also its value. We may detect the arrival 
of such a message, but may not transmit any addi- 
tional intelligence. 


I name the mechanism which allows us to trans- 
act a single one-way one-bit communication 
elementary communtcatton mechanism, and re-state 
that its existence is contingent on the availa- 
bility of a magical interference-proof mailbox, 
provided (in a properly initialized state) by 
some benevolent instrumentality. If we do not 
accept the premise of such an initial mailbox, 
we may never be able to construct the very first 
IPC mechanism. 


MUTUAL EXCLUSION 


There is no point in elaborating the limited 
usefulness of the elementary communication mech- 
anism. Its significance lies in the fact that 
it might serve us as a building block for the 
construction of more useful, more sophisticated 
IPC mechanisms. For example, a useful mechanism 
-- such as the WAIT/NOTIFY functions suggested in 
[11] -- would be capable of a continuous sequen- 
ttal transmission of variable-length information- 
laden messages, and would also have a buffering 
effect minimizing the necessity for non-productive 
waits. We may visualize the communication-channel 
effect of such a mechanism in the form of a 
one-way information "pipeline"; the sender stuffs 
messages into one end, the receiver opens his 
faucet whenever necessity requires and draws 
information out of the other end. The realization 
of such a mechanism hinges on our ability to 
construct an interference-proof "pipeline"-type 
mailbox. 


Yet if we reconsider the meaning of 
"“interference-proof" we realize that all that is 
necessary is the assurance that among WV communi- 
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cating processes, W-1 would refrain from accessing 
the mailbox while the Nth process is manipulating 
it. Our primitive mailbox guarantees this by its 
very nature; the same effect can be achieved for 
any arbitrary component of the memory space if the 
processes agree voluntarily to adopt such a pattern 
of non-interfering behavior whenever the mailbox 

is being accessed. 


Such agreement should be semantically express- 
ible. Let us postulate a pair of functional | 
mutual-exclusion brackets names MUTEX/XETUM ‘3J), 
Whenever a process intends to access the mailbox, 
it announces the intention by performing a 
MUTEX (matlbox). When it has finished manipulating 
the mailbox it signals the mailbox's availability 
by performing a XETUM(matlbox). The logic of 
these functional brackets is such that at any 
given time at most a single process will be manip- 
ulating the mailbox. 


Observation #4: The nature of our mailbox is now 
radically changed! While the elementary mailbox 
guaranteed coherency by its very nature no 
matter how the processes chose to access it, a 
mailbox whose coherency is achieved via the 
application of MUTEX/XETUM will remain 
interference-proof only as long as the communi- 
cating processes choose harmoniously to cooperate 
with one another. Let a single communicating 
process "do its own thing" and we are faced with 
an unbridgeable communication gap. 


And how would we manufacture these functional 
mutual-exclusion brackets? Their nature implies 
a whole new dimension of underlying communication 
and cooperation among processes, and it might be 
argued that it is foolhardy to re-invoke the 
"chicken or egg" situation by proposing to solve 
a problem through a mechanism which manifests the 
same problem. We might have been forced arbitra- 
rily to postulate the existence of MUTEX/XETUM 
as we have done earlier. 
of a Problem in Concurrent Progranming" [6], 
Dijkstra has demonstrated that the availability of 
an interference-proof mailbox is sufficient to 
assure the implementability of MUTEX/XETUM ‘X) , 

And once we have constructed these mutual-exclusion 

brackets, the road is clear to the construction of 

mailboxes of arbitrary complexity and sophistica-~ 

tion. 

(3) 
The use of the inverted left bracket clause to 
designate the right bracket is inspired by the 
BLISS [13] systems programming language. The 
name MUTEX, originally used by Dijkstra [5] to 
designate a mutual exclusion semaphore, has for 
some time been used by rank-and-file programmers 
to designate the mutual-exclusion P [7] opera- 
tion (possibly because of the confusion between 
mutual-excluston and private semaphores); it is 
employed here post facto. 


(k) 


Note that Dijkstra's mechanism requires, in ad- 
dition to the coherent binary mailbox, a coher- 
ent integer mailbox k. Disregarding the possi- 
bility of modifying the algorithm to all-binary, 
we can safely postulate the integer mailbox for 
our purposes, because the hardware designer 
knows how to build it out of binary mailboxes 
(flip-flops). 


Luckily, in his "Solution 
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THE IPC-SETUP 


The existence of the elementary communication 
mechanism is conditional, depending upon a number 
of arbitrary postulates and conditions. These 
were introduced in a sequence dictated by the 
orderly development of the subject. These process 
communication prerequisites are the essence of 
this study, I shall therefore re-state them in an 
organized fashion. They are subdivided into three 
classes 1) conditions relating to the very nature 
and existence of processes, 2) conditions relating 
to the postulated, instrumentality-given mailbox, 
and 3) conditions relating to the processes' 
cooperative behavior. 


First, we have to delimit our consideration to 
processes whose nature makes them capable of 
meaningful mutual communication. Processes which 
wish to communicate belong to a "set of compatible 
processes". The set is defined by the processes 
which communally display all of the properties 
listed below. A process that does not possess all 
of the properties peculiar to a given set does not 
belong in that set, but assuredly belongs in some 
other set. 


Property #1: All WV communicating processes must 
be sequential (1 


Property #2: All W memory spaces of the communi- 
cating processes must overlap (at least) a 
single common subset. 


Property #3: All W lifespans of the communicating 
processes must overlap in Time. 


Property #4: The intervals typical of all 
communicating processes must be compatible. 


Property #5: None 


times must arrive 
critical section. 


of the WV processes" termination 
during the respective process' 


Second, we have to postulate the availability 
of an interference-proof mailbox. This requires 
in turn that we postulate the existence of a 
benevolent deus ex machtna or "instrumentality" 
which has a vested interest in letting the 
processes communicate, and which manifests this 
interest by conveniently providing the required 
mailboxes. 


Postulate #1: There exists an instrumentality 
whose purpose it is to create mailboxes. 


Postulate #2: A mailbox has the natural inherent 
property that its contents can never be in an 
unstable or incoherent state. 


Postulate #3: The mailboxes are accessible to 
all W communicating processes because the 
instrumentality saw to it that they reside in 
the common memory space(s). 


Postulate #4: The instrumentality has thought- 
fully pre-set all the mailboxes to a non-message 


Vintt state ata point in Time which precedes 


(1) nis condition applies to the communication 


model developed in this paper. By devising a 
list of different process communication 
prerequisites, a model conductive to non- 
sequential process communication may undoubtedly 
be devised. 
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the creation time of any one of the MW communi- 
cating processes. 


Third and last, the processes' rules of behav- 
ior must be set up in a manner which will guarantee 
that they always adopt a pattern of harmoniously 
cooperative behavior insofar as communication is 
concerned. For this purpose we conveniently pos- 
tulate another entity, that of the programmer, 
who is responsible for implementing these rules 
of behavior into the logic of those WV communica- 
ting processes. The cooperative behavior is made 
possible by the adoption of certain conventions 
which all W processes agree to respect. Such 
common knowledge of conventions is in itself a 
manifestation of a previously transacted communi- 
cation. As suggested in [11] I name this mani- 
festation of pre-natal communication IPC-Setup. 

It originates in the single mind of the single 
programmer (thus, no "chicken or egg" dilemma) 

who incorporates it into the essence of the VW 
processes prior to their creation time. The nature 
of the conventions depends on the nature of the 
communication; following is the list of conventions 
required for the existence of our elementary commu- 
nication mechanism: 


IpC-Setup #1: The W communicating processes agree 
on the common name of the single (commonly acces- 
sible) mailbox to be used. 


IPpC-Setup #2: The processes agree to use that 
mailbox for the purpose of communication. 


IPC-Setup #3: The processes agree on their res- 
pective sender/receiver roles. 


IPpC-Setup #4: The W communicating processes 
agree to interpret the value Vinit, with which 
the instrumentality is known to have initialized 
the mailbox, as a non-message implying "no mes- 
sage has yet arrived". 


IPC-Setup #5: The receiving processes agree to 
interpret any non-Vintt state of the mailbox as 
implying "a message has arrived". 


IPpC-Setup #6: They further agree to assign a 
meaning to any non-Vinit state of the mailbox 
and to interpret that value in some meaningful 
way. 


BACK TO PRACTICALITY 


A thesis was presented to the effect that 
organized, deliberate and meaningful communication 
does not spontaneously erupt into being; rather, 
it can always be traced to some pre-existing in- 
stance of preparation and communication. Many 
definitions, decisions and postulates made during 
the development of this paper were admittedly 
arbitrary, and openly acknowledged as such. My 
purpose was not to insist on a certain dogmatic 
point of view, I do not believe that this nebulous 
subject would ever accommodate dogmatism, but 
rather to convey some insight into the complex 
issues that have to be resolved before we can 
safely communicate a single bit of information, 
once only, between processes. 


This study was motivated by the need to 
resolve the "chicken-or-egg" dilemma. It proposes 
a certain hierachy of causality: the interference- 


proof mailbox, the IPC-Setup, the elementary 
communication mechanism, and lastly the mutual 
exclusion function. Some other such hierarchy 

and its related list of communication prerequisites 
may undoubtedly be developed; I doubt that such a 
list of different prerequisites would be any less 
voluminous than the one proposed. 


The causality (and terminology) developed in 
this paper lend themselves to the description and 
understanding of various IPC mechanisms. To 
illustrate, let me present the workings of the 
asynchronous serial simplex channel connecting a 
sending source to an electro-mechanical printing 
device (e.g., teletypewriter). 


Both sending and receiving process are essen- 
tially devoid of buffering memory. The sender 
generates its message, the receiver intercepts it 
and acts on it. The commonly accessible mailbox 
consists of an electrically conducting wire 
connecting both machines. The presence/absence 
of current, or a high/low voltage arrangement 
represent the two value-states of the mailbox. 
The mailbox is reasonably coherent but is not 
interference-proof; it is said to be susceptible 
to "noise". 


The list of process communication prerequi- 
sites applicable to this example is somewhat 
different from the one developed in this paper. 
In order to make the mailbox capable of transmit-—- 
ting two meaningful kinds of messages, namely 
bits zero and one, the mechanism does not support 
the notion of a non-message Vinit. Instead, by 
means of two (instrumentality-provided) synchro- 
nous clocking devices respectively incorporated 
into the two processes, each process is decomposed 
into a continuous sequence of "mini-processes" 
(the reader may wish to re-read observation #1). 
The lifespan of each mini-process is delimited to 
the duration of a single clock tick, and the 
mailbox is reset to the FALSE state at mini- 
process creation time. If a TRUE state is detec- 
ted by the receiving process during its short 
lifespan, it is interpreted to mean "a one-bit 
has been received", otherwise upon its termina- 
tion time a zero-bit message is assumed. A new 
mini-process is created and the same communication 
ritual is re-enacted. 


By adding a clocking device and modifying the 
IPC-Setup, we have instilled more usefulness into 
the elementary communication mechanism. Also 
note that the judicious choice of "which process 
do I wish to observe" (i.e., "mini-process" vs. 
the larger "thread") is the key to this function- 
al presentation. 


The elementary communication mechanism is not 
very useful to the programmer. The effort of 
manufacturing functions MUTEX/XETUM, with which 
then to construct a more elaborate communication 
mechanism, is far from negligible. We therefore 
habitually require the availability of some pre- 
fabricated mutual exclusion primitives (such as 
interrupt inhibition) which we then consider, 
from the programming point of view, as elementary. 


IPC mechanisms are typically designed to be 
easily applicable to the kind of processes which 
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exist within computer systems. They therefore are 
cognizant of two peculiarities of the computer 
process 1) the process is typically of a cyclic 
nature (i.e., may be decomposed into a repetitious 


sequence of essentially identical "mini-processes"), 


and 2) the virtual time flow in which the processes 
exist may literally be stopped and started. 


The process's cyclic nature implies that unless 
the correspondent processes are pre-synchronized, 
harmoniously ticking away as does the exemplified 
teletype, a yet un-received message may erroneously 
be overwritten by the next, and the next...etc. 

We typically rule such pre-synchronization out 
because asynchronous processes can normally be put 
to better use. Instead, we implement a "pipeline" 
capability into even the binary mailbox, trading 
off inherent synchronization vs. inherent buff- 
ering effect. Such a buffer, or list of one-bit 
messages, is trivially implemented in the form of 
a binary counter. Assuming the availability of 
MUTEX/XETUM, the sending operation is now: 


MUTEX (mailbox) ; 
mailbox := mailbox + 1; 
XETUM(matlbox) ; 


and assuming that the zero state implies "mailbox 
is empty", the receiving operation is 


busyloop: 

MUTEX (mat lbox) ; 

IF matlbox = 9 THEN 

BEGIN 
XETUM (mat lbox) ; 
GOTO busyloop; 

END; 

matlbox := matlbox - 1; 

XETUM(matlbox) ; 


Virtual processors are artificial constructs 
derived from some real life hardware CPU resource. 
In a system with W virtual processors, any non- 
productive activity of one is to the detriment of 
all others, wastefully misusing a finite CPU _ 
resource which could be put to some good produc- 
tive use elsewhere. Our busyloop is archtypical 
of such wasteful behavior. 


It is therefore economically desirable to 
include in the IPC mechanisms which are put at 
the programmer's disposal a provision by which 
a waiting process not only suspends its logical 
progression, but literally causes its virtual 
time flow to stop. Once stopped, the process is 
said to be "dormant" and can no longer insistently 
test the mailbox for the awaited message. It is 
the cooperative sending process which, after 
having deposited its message, helpfully "nudges" 
the dormant process back into wakefulness. 
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summary 


In 1972, The Digital Equipment Corporation sponsored 
a limited-objective research project to investigate 
the properties of the new kernel/domain systems archi- 
tecture, whose theoretical model was earlier developed 
by Spier [1]. Acompanion paper [2] reports on that 
project. The domainisa monitor (or supervisor, 
executive) -Like local independent address space which 
may be mapped over acollection of (mostly) exclusive 
memory space partitions to provide a protected run- 
time environment. Similar to the classical monitor, 
control may be transferred into the domain through pre- 
designated inter-domain entry points named gates [1]. 
Ina domain system, supervisory code no longer resides 
in asingle monolithic monitor, but is distributed 
among a number of supervisory domains; of these, the 
most central and most critical supervisory domain is 
named kernel [1] [2] [3]. The kernel is responsible for 
basic resource management only and is by definition 
devoid of any decision making code. 

Tf we view the term process as meaning the activity 
of aprocessor within a memory space [4] then the 
execution of a processor within adomain (read, ex- 
clusive memory space) is an independent process. In a 
domain system where a single user computation may cause 
the activation of many domains, that computation's 
sequentiality may be viewed as the sequential activa- 
tion of many processes. For the sake of conformity, 
we chose to apply the term process to the larger sequen- 
tiality, and coined the term domatn-tnearnatton [2] to 
designate the execution of asingle domainby a single 
processor. The transfer of control fromone domain to 
another, although synchronous and sequential, dis- 
plays some of the properties inherent to interprocess 
communication (IPC) mechanisms [4]. Our kernel- 
implemented comprehensive inter-module communication 
mechanism handled the following cases: 

1. The explicit sequential activation of a procedure 
entry point, expressed in the form 

CALL procedure (argument-ltst) ; 

The implicit sequential activation of a procedure 
entry point, currently known to be the handler 
for some predesignated condition, expressed in 
the form SIGNAL conditton(argument-list) ; 

The explicit non-sequential activation of a pro- 
cedure entry point by some other process, ex- 
pressed in the form 

INTERRUPT process, procedure (argument-ltst) ; 

The implicit non-sequential activation of a proce- 
dure entry point by some other process, where the 
procedure is currently known to be the handler for 
some predesignated event, expressed inthe form 
NOTIFY event(argument-ltst) ; 

Notice that the event declaration always included the 
declaration of the currently handling process (es), so 
that the process identity did not have to be explicitly 
mentioned within the WOTIFY sequence. 


2 


(a) This paper reports on a pure-research project, 
and may not be construed to imply any product 
commitment by the Digital Equipment Corporation. 


5. The abnormal cancellation of a sequence of calls 
through a non-local GOTO toa predesignated en- 
try point declared to be a handler for the 
unwind condition, expressed in the form UNWIND; 

Note that both conditions and events came in two 
flavors: 1) LOCAL to remain in effect only as long 
as the procedure activation that declared them, and 
to automatically be terminated upon RETURN from that 
procedure activation, and2) GLOBAL to indefinitely 
remain in effect until explicitly terminated. 

Thus, all the above inter-module communication 
functions were kernel-managed and invariably re- 
sulted in the argument-carrying formal activation 
of a procedure entry point, to uniformly be dis- 
missed via a formal RETURN; Both of our inter- 
process communication functions were a software 
simulation of the classical hardware interrupt. 

Given our predominant concern to keep the kernel 

application independent, the software interrupt 

facility seemed to be the most general mechanism 
conceivable. A special kernel-call of the form 

SLEEP (titme-limtt); would put the calling process into 

a dormant state to be re-awakened wheneither 1) an 

INTERRUPT or NOTIFY signal is received, or 2) the 

time-limit has expired, whichever happened first. 

The asynchronous nature of INTERRUPT and NOTIFY im- 
plied a certain minimal argument-buffering facility 
within the kernel. Also, the activation of a procedure 
entry point by either of the asynchronous invocations 
caused all further asynchronous signalling to that 
same process to be inhibited, until areturn was made. 

We had additional, more specific kernel-calls to more 

finely control the inhibition/reactivation of inter- 

process signals, aswell as mutual exclusion functions 

MUTEX/XETUM [4] which were also kernel-implemented. 

Finally, note that our choice of the software- 
interrupt facility did not preclude the availability 
of more classical IPC interfaces, such as 

MSG:=WAIT (mailbox); and NOTIFY(matlbox,MSG); [5]. 

Such mechanisms could be implemented within dedicated 

supervisory domains by means of the tools just des- 

cribed. One of the reasons for choosing these more 
general tools was to provide the ability for virtual 
user computations to be multiprogrammed. 
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summary 


The construction of sorting networks 
has been a topic of much recent discussion 
[l] - [5]. In view of the apparent dif- 
ficulty of verifying whether a reasonably 
large proposed sorting network actually 
does sort, the most useful approach for 
constructing large networks seems to be to 
devise a recursive scheme which constructs 
a network which is guaranteed to sort, ob- 
viating the verification phase. Examples 
of this approach are presented in [1],[5]. 
In this note, another such approach is 
presented. 


The most economical 16-line sorter 
Known has been constructed by Green [3], 
[4]. His approach is to successively sort 
lines whose indices differ in one compon- 
ent of the binary expansion. This yields 
a partial ordering of the lines which is 
isomorphic to a Boolean "n-cube" configu- 
ration. This configuration is then further 
sorted to yield a linear order. The net- 
work for accomplishing this is constructed 
in a clever, but ad hoc manner, and no 
techniques for extending this approach to 
larger numbers of lines have appeared. 


In this note such a technique is pre- 
sented. However, it suffers from the fact 
that it produces networks which are no more 
economical than the odd-even merge networks 
of Batcher [1]. Nevertheless, some in- 
sight may result from a knowledge of this 
technique. 


The approach is to reduce an n-cube 
configuration to an n-m cube in which the 
vertices represent linear orders of m com- 
ponents. A recursive rule is given which 
applies this technique to obtain a complete 
sorting network and the correctness of the 
rule is proved. It is then shown that the 
number of comparisons for an n-line net- 
work are the same as Batcher's construction, 
although the networks are definitely not 
isomorphic to Batcher's. For certain 
numbers of lines, this method yields net- 
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works which are related to Batcher's by a 
kind of "flipping" operation described in 
[2]. Precisely what relation holds be- 
tween these two constructions has not yet 
been discovered. 


A complete presentation of these 
results appears in [6]. The construction 
is derived for the more general k-ary n- 
cube, but upper bounds are only shown for 
k=2 (the "Boolean" case). Whether other 
values of k yield better results has not 
been thoroughly investigated. Proofs of 
correctness are done in terms of partial 
orders, using a useful and general lemma 
about "cross products" of partial orders 
and the technique of Liu [7]. 
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Abstract -- Various 2-dimensional iterative 
arrays for the combined parallel implementation 
of signed binary multiplication and division are 
presented. Speed and cost comparisons are made 
with both commercial arithmetic units and recent 
design and prototype studies. It is shown that 
combined function arrays can be both speed and 
cost competitive with separate function arrays. 


Introduction 


Large, iteratively structured, combinational 
networks for all four basic arithmetic functions 
(Add, Subtract, Multiply, and Divide) have become 
a practical reality in high-speed, general- 
purpose scientific computers [1],[2] and special 
purpose applications [3],[4]. Recent design and 
prototype studies [5],[6],[7] on feasible varia- 
tions have also been reported. 


The parallel processing speed of the sub- 
system units for each arithmetic function has been 
enhanced from a system throughput viewpoint by 
employing both duplicated units and pipelining 
[1],[2]. On a uniprocessor system, the effective- 
ness of these latter system designs depends to a 
large extent on program and instruction mix as 
well as depth of instruction lookahead. 


In most of the references cited above, there 
is a tendency towards optimizing a large combina- 
tional subsystem unit for each arithmetic func- 
tion. Duplicating or pipelining these separate 
function units then achieves the desired system 
speed. A commercial exception is [1] in which a 
particular unit performs multiply or divide under 
appropriate conditioning and sequencing. Also, 
the design studies of [8] and [9] investigated a 
planar logic array that combines the same two 
functions. 


The purpose of this paper is to present new 
combined Multiplier/Divider (MD) iterative arrays 
and analyze their effectiveness as compared to 
current alternatives. The MD arrays are 2- 
dimensional and accept two, signed, binary 
operands in 2's-complement notation along with a 
binary signal to denote M or D. A double-length 
product, or quotient and remainder, are generated 
after a specified delay. The basic approach is 
to start with a simple (but relatively slow) con- 
figuration, called MD1, that is similar in com- 
plexity to [8] and [9]. Design changes to 
increase speed are then incorporated in 3 suc- 
cessive steps that result in the MD4 array that is 
comparable in speed to the fast individual func- 
tion arrays of [6] and [7], while at the same 


time has a cost much less than the sum of the 
costs of the individual function arrays. 


The two basic parameters that are used for 
comparisons throughout the study are logic delay 
and cost. Delay is expressed in terms of a nor- 
malized value t that represents delay through a 
functional level (AND-OR, NAND-NAND, NOR-NOR, 
etc.) under reasonable fan-in constraints on all 
gates. Processing rates based on pipelining are 
covered elsewhere [10]. Two different cost 
criteria are considered. Gate costs assuming 
individual gate counting is used, as well as in- 
tegrated circuit count for reasonable assumptions 
on MSI level circuits. Both of these methods are 
justified in terms of currently available integ- 
rated circuits. 


The Basic Comparison Parameters 
Logic Delay 


As stated above, all delay expressions will 
be stated in terms of a normalized value t that 
represents delay through a functional level. The 
choice of a delay unit such as t is not a 
straightforward one. Hopefully, the reason for 
choosing a delay unit in any logic design is to 
arrive at as simple and as accurate a measure as 
possible of the delay through an implementation 
of the design in some particular logic circuit 
family. This is achieved by substituting a 
typical value of time (say 12-16 nanoseconds in 
some TTL technologies) for t in the delay express- 
ion. Now consider where this technique causes 
problems. In arithmetic arrays, the full adder 
(FA) function (three inputs, sum (S) and carry (C) 
outputs) and the exclusive-or (EX-OR) function 
usually account for a large part of the logic 
design components. If we assign t on a functional 
level, as indicated above, all three outputs, S, 
C, and EX-OR, will occur with delay t after inputs 
are available. (It should be noted that we ignore 
any input inversions needed in both delay and cost 
computations.) But, in many TTL integrated cir- 
cuits, the delays in producing these three func- 
tions can be appreciably different. For instance, 
EX-OR might be 1-2 times the delay of a single 
NAND gate, and C is typically substantially faster 
than S. This presents a fundamental problem in 
attempting a general delay measure that is useful 
in comparing various logic designs to gauge their 
implementation effectiveness. Our compromise is 
to use delay expressions involving t as defined 
above. We then claim that, although they might 
not be accurate enough to compute absolute delays 
achievable in implementing various designs (based 
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on some average t for a certain logic circuit 
family), they suffice for our purposes of getting 
some quantitative figure of merit for comparisons. 
In fact, we depend on the S, C, and EX-OR type of 
discrepancies being averaged out along the longest 
delay path in the various designs. 


Logic Cost 


The technology used (wired-or capability, 
etc.) and level of integration (SSI, MSI, etc.) 
assumed complicate the definition of a suitable 
logic cost measure probably to a greater degree 
than they affect the adoption of a simple delay 
measure, as discussed above. In this paper, we 
will base logic cost on one of two distinct 
measures. The simplest and most often used 
measure will be total gate count. Since we are 
discussing relatively large combinational arrays 
of logic circuits, where fan-in ranges normally 
from 2 through 4, we will not explicitly include 
inputs in our gate cost measure. Implicitly, of 
course, the basic gate cost unit, g, can be taken 
to mean the cost of some "average'' gate which 
"on the average'' might have 3 inputs. Another 
cost measure that we will use in one instance is 
that of integrated circuit count under some 
reasonable current technology complexity level. 
This technique will be given in more detail later 
when it is applied. 


Other Possible Parameters 


Other design parameters that might illuminate 
the comparative merits among various logic designs 
are possible. Interconnection crossover complex- 
ity, array cell regularity, standard function 
utilization are among these. We will not work 
out the details on any other than delay and cost 
as defined above; however, we present logic 
diagrams, for all four arrays discussed, in enough 
detail that anyone can derive particular figures 
of merit that might be of interest. 


Four Multiplier/Divider (MD) Arrays 


The most familiar binary multiplication 
algorithm is to shift the multiplicand (B) left 
once for each multiplier (R) bit position, after 
adding B into an accumulating partial product 
(A) if the corresponding R bit is 1, until A 
finally becomes the product P = B-R when all 
multiplier bits, low order to high order, have 
been used. This scheme has been stated for posi- 
tive operands; but, by modification due to Booth 
fll], it can be made to work for signed operands 
in 2's complement representation, yielding P 
directly in the correct 2's complement form. 
Subtraction of the multiplicand, as well as addi- 
tion, is possible. Each operation decision is 
the result of inspecting the appropriate multip- 
lier bit and its right-hand neighbour at each 
step. 


One of the standard division algorithms is 
the non-restoring algorithm operating on a divi- 
dend A with a divisor B to generate a quotient Q. 
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The altered dividend is referred to as the partial 
remainder at each step. An operation cycle is as 
follows: The sign of A is inspected. If it is 
positive, B is subtracted from A and if it is 
negative, B is added to A. The quotient bit 
generated is the complement of the sign bit of the 
new partial remainder, The divisor is shifted one 
binary position right after each cycle. Many 
authors have discussed this scheme; see, for 
example, Guild [12]. This algorithm can also be 
modified to operate on signed operands; however, 
the quotient generated is correct if it is posi- 
tive; but it is in 1's complement if it is nega- 
tive, So a one must be added later to convert it 
to 2's complement notation. Separate planar 
arrays of cells, each usually containing a full 
adder with controlled inputs, can be constructed 
fairly directly from these or similar algorithms. 
For instance see, Majithia and Kitai [13], 
Bandyopadhyay, et al [14], Deegan [15], and Hoff- 
man, et al [16] for array implementations of 
multiplication based on variations of the above 
basic scheme. Division array implementations 
based on variations of the above discussion appear 
in Guild [12], Dean [17], Gardiner [18], and 
Gardiner and Hont [19]. 


MD1 Array 


When we attempt to combine the separate 
arrays, the only sensible arrangement seems to be 
to associate the B vector (multiplicand or divi- 
sor) positions with each other and the A vector 
(partial product or dividend and partial remain- 
der) positions with each other, moving downward 
through the rows of the array. That is why we 
have combined their names. The multiplier, R, 
and quotient, Q, are positioned at the left 
column edge of the array. There is one complica- 
tion. In multiplication, B is shifted left with 
respect to A; but in division, B is shifted right 
with respect to A, The solution is to shift B 
right with respect to A in multiplication, and 
inspect and use the multiplier bits high order to 
low order instead of in the other direction as 
above. This is the scheme developed by Majithia 
and Kitai [13]. The arrays can then be combined 
as MD1 in Figure 1. In general, the B, Q and R 
vectors are n bits long, including the sign bit 
in the case of B and R. The A vectors are 2n-1 
bits long, including the sign bit. It is con- 
venient to consider the operands in fraction form, 
with A and B normalized in the case of division. 
We then have, (where all coefficients = 0 or 1): 


B (Multiplicand or Divisor) = Bo-B).. 


Boy 
=<: By20" Bil") ee ca¥ Basa 


R (Multiplier) = 
= - Ro2% 

Q (Quotient) = 
= Qp2% 

A (Partial Product 
= -Ag2°+ 


Ro:Ry... Rn-} 

Ryo eo se Raye ed) 
Qo-Qi..s Qn-1 

Qi27} 4.0.4 Qp-y27@-Y) 

or Remainder) = Ag-Aj... Aon-j 
Ay27} +...4Aon 127 2871) 
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In the case of division, Qg is not the sign bit, 
but is a significant bit of the answer. This is 
because !/2 < Q < 2 for 1/2 < A,D < 1. The sign 
bit for Q is thus not indicated in our arrays. 
The output of the M function in cell 1 of MDI 
(Figure 1(b)) is given by M = D,D2By + D,D2Bj. 
The "function" signal, F, is set to 0 for multip- 
lication and 1 for division; cell 2 (Figure l(c)) 
then routes the multiplier bit pairs Rx, Rx+1 for 
Booth algorithm control in multiplication, or 
routes the sign bit control for division. Note 
that all Ry bits must be set to 0 when division 
is being performed. Thus, cell 2 acts as a con- 
trol column of cells, and the M function in cell 1 
uses the control signals to appropriately apply 


the correct version of the B vector to the A vec- 
tor. The cost and delay expressions are: 

MD1 cost = (18n* + 2n)g (la) 
MD1 mult. delay = (2n + 1)t (1b) 
MD1 div. delay = (n2 + 2n)t (lc) 
The combined multiplier divider arrays of Gex [8], 
and Gardiner and Hont [9] are similar in complex- 
ity of design and have about the same cost and 


delay properties. 
MD2 Array 


Our procedure is now going to be to intro- 
duce substantial design changes in three success- 
ive steps starting from MDl. They are substantial 
in that the basic algorithms for carrying out the 
arithmetic operations are altered significantly. 
In MD2, the partial product/remainder vector A is 
not developed explicitly at each row level but is 
represented by two binary vectors S and C, which, 
if added would produce the correct vector A at 
that row level. This is the familiar carry-save 
reduction technique that was originally introduced 
by Wallace [20] in a 3-dimensional multiplier 
logic design. The two vectors S and C are the 
result of a 3-to-2 carry-save reduction on the 
previous row's S and C vectors and the proper ver- 
sion of the B vector. In the case of multiplica- 
tion, this necessitates a length 2n-1 fast adder 
operating on the S and C outputs of the last row 
to produce the product P. This is indicated in 
Figure 2(a). Since the division process requires 
the sign of A to determine the subsequent row 
operation, this must be determined by a carry 
lookahead network L at the leftend of each row. 
It operates on generate and propagate functions 
formed in the type 3 cells. These Gj and Pj 
functions are formed from the S and C vector out- 
puts of each row. An examination of the non- 
restoring division algorithm reveals that the 
carry-out from the sign bit position directly 
yields the quotient bit, so this is the way it is 
done in Figure 2. This observation actually 
constitutesa suggested improvement to the design 
in [6]. Q; is then fed to the control cell 5 of 
the next row. The L cell must be redesigned for 
each operand length, and if fan-in is constrained 
to equal to or less than eight, then a two-level 
(2t) lookahead scheme must be employed for 
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n2 10. This is reflected in the cost and delay 

figures shown below. The cost and delay express- 

ions are: 

MD2 cost = (21n*-n)g for n < 10 (2a) 
(21n2+2n/n)g for 10 <n < 64 7 

MD2 mult. delay = (n + 2)T (2b) 

MD2 div. delay = (6n)t for n < 10 (2c) 

(7n)t for 10 <n < 64 


The cost increase from MD1 to MD2 is small com- 
pared to the speed gain, especially in the case of 
division, which has been made essentially linear 
in n over practical operand ranges. This form of 
array division algorithm is due to Cappa and 
Hamacher [6] and the carry-save technique (along 
with multiplier bit grouping) has been used by 
Ramamoorthy and Economides [7] in a high-speed 
planar multiplier array. It is to be noted that 
the cost and delay of the Fast Adder has not been 
included in the above expressions. It can be 
designed (with carry lookahead techniques) so that 
it does not change any of the expressions by more 
than about 20%. To our knowledge, MD2 and the 
next two arrays have not appeared in the litera- 
ture. 


MD3 Array 


The next change to make is to decode the 
multiplier bits in pairs and generate two quotient 
bits at a time. Although this increases the com- 
plexity of each row of cells in the array, the 
number of rows is reduced by a factor of two. 
net cost saving then results. We get MD3, as 
shown in Figure 3, by making these two changes to 
the MD1 structure. When the MD2 techniques of 
carry-save reduction and carry-lookahead are also 
incorporated into MD3 we will finally have evolved 
to MD4 which is in the next subsection. The mul- 
tiplier bit grouping technique is well known and 
has been used by Wallace [20] and Ramamoorthy and 
Economides [7] in their arrays so it will not be 
detailed here. The technique for generating two 
quotient bits at a time is somewhat more complex 
but has also been adequately described in detail 
by Flores [21]. It necessitates having 3/2 the 
divisor available as an input vector. We assume 
that this is formed before division is begun and 
is presented as one of the inputs. The r, s, t 
bundle of inputs into cell 6 (Figure 3(b)) is 
really a bit position of 1/2 B, B, and 3/2 B in 
the case of division; and in the case of multip- 
lication, r and s represent one bit position of 
1/2 B and B, respectively, with t not being used. 
The control signals T,; and Ty, which are outputs 
from control cell 7 (Figure 3(d)) are used to 
select appropriately among r, s, t or their com- 


A 


plements. This selection is done in the E logic 
(Figure 3(c)) of cell 6 (Figure 3(b)). The D 
signal decides complementation or not. An ins- 


pection of the wiring of cell 6 should convince 
the reader that the 2-place shift per row of B is 
performed correctly. The signals T,, Tg and D are 
determined by a multiplier bit pair and its adja- 
cent bit neighbour on the right in the case of 
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multiplication; and by the leading three bits of 
A and bits Bg and Bo in the case of division. 
This is all accomplished in control cell 7 
(Figure 3(d)) along with the generation of two 


quotient bits in the case of division. The cost 
and delay expressions are: 

MD3 cost = (13n* + 37n + 25)g (3a) 
MD3 mit. delay = (2n + 3)t (3b) 
MD3 div. delay = (n*/2 + 2n + 3)t (3c) 


There are actually small variations in these 
expressions depending on whether n is even or odd, 
but in each situation we have given the worst 
case, Compared to MD1, in MD3 the cost is 
appreciably lower, multiplication time is about 
the same, and division time has been halved. MD3 
is slower than MD2, but costs less. The final 
evolution to MD4, which incorporates the MD2 tech- 
niques of carry-save reduction and carry-lookahead 
will prove to be the best design on all counts. 

It should again be noted before we leave this 
section that the time and cost involved in genera- 
ting 3/2 B has been neglected. For practical 
values of n, this is a reasonable assumption. 


MD4 Array 


If the carry save and carry lookahead tech- 
niques described in the MD2 subsection are applied 
to the MD3 structure, we obtain the MD4 array 
shown in Figure 4. Since the control cell 7 is 
Figure 3(d), no further discussion of it is 
needed. Also, the E function in the main body 
cells 8 and 9 (Figures 4(b) and (c)) is the same 
as in Figure 3(c). The remainder of cells 8 and 
9 is much the same as in cell 6 (Figure 3(b)) of 
MD3, the differences being that S and C vectors 
are produced to represent A, and P and G functions 
are included to provide inputs to the lookahead 
computation. The L cell of Figure 4(a) is similar 
to the L cell of Figure 2 and is used in MD4 to 
produce the carry-in to the Ay position. This 
carry and the S and C vector bits for partial 
remainder positions Aj, Aj, and A» are inputs to 
the CL cell (Figure 4(d)). The CL cell computes 
Ag, Ay, and A» which are needed in the control 
cell 7 for the division process. The cost and 
delay expressions are: 


MD4 cost = (15n% + 47n + 33)g 
for 7 <n < 13 (4a) 
(15n2 + 47n + 33 + (n+1)v¥n-4/2)g 
for 14 <n < 68 
MD4 mult. delay = (n/2 + 4)t (4b) 
MD4 div. delay = (3n + 5)t for 7 <n < 13 (4c) 


(4n + 6)t for 14 <n < 68 


Again, as in MD2, the Full Adder has been omitted 
from both cost and delay expressions, as well as 
the formation of 3/2 B. 
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If we substitute the practical range of 
values n = 8, 16, 32, and 64 into equation sets 
(1), (2), (3), and (4), we obtain Table I, which 
allows convenient comparisons among the MD arrays. 
It is easy to conclude that MD4 is the best design 
from the cost/delay effectiveness standpoint. The 
rest of this section will be devoted to comparing 
MD4 to members of two classes of multipliers and 
dividers. 


Other Logic and Prototype Designs 


In this subsection, MD4 is compared to two 
high-speed planar separate function arrays that 
have been reported. The multiplier array (RE) of 
Ramamoorthy and Economides [7], that uses bit 
grouping of the multiplier and carry-save reduct- 
ion as in MD4, has approximate cost and delay 
expressions as follows: 


RE array cost = (10n* + 8n + 26)g 


RE array mult. delay = (n/2 + 2)t 


(Sa) 
(5b) 


The division array (CH) of Cappa and Hamacher [6] 
that uses carry-save reduction and carry lookahead 
but generates only one quotient bit per row as in 
MD2, has approximate cost and delay expressions 

as follows: 


CH array cost = (17n* + 10n)g for n< 10 


(172 + lin + 2nvn)g (6a) 

for 10 <n < 65 
CH array div. delay = (4n)t for n< 10 (6b 
(Sn)t for 10<n<65 y 


Table II allows a concise comparison of the RE, 
CH, and MD4 arrays. 


Commercial Structures 


The Advanced Micro Devices (AMD) Co. [22] 
produces a 2 bit x 4 bit 24-pin MSI multiplier 
chip (the AM2505) and a 4 bit 24-pin MSI adder 
chip (the AM9340) that can be used as the basic 
cells in a multiplication array. They use bit 
grouping of the multiplier, do not use carry-save 
reduction, but use a carry lookahead scheme for 
fast propagation of the carries along each row of 
AM2505's. The AM9340's are used in parallel to 
accumulate a set of partial products into the full 
product. At an operand length of n = 16, the 
delay is approximately 30t as compared to about 
17t for MD4, The AMD array delay value is derived 
by evaluating the logic equivalent of their chip 
in a manner consistent with our evaluation of the 
MD, RE, and CH arrays. Since the AMD array gene- 
rates the product P, we have included in the 17t 
MD4 delay a plausible amount (5t) for the 32-bit 
lookahead adder needed in conjunction with the 
basic MD4 structure. 


There are 32 AM2505 chips and 16 AM9340 chips 
needed at n = 16. Now if we consider that 4 of 
the main body cells (8 and 9) in the MD4 are 
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implemented in a single 40-pin MSI chip, the MD4 
would require about 40 of these chips plus the 
final full adder (8 AM9340 type of chips) and the 
left column of control logic (cells 7, CL and L). 
If we estimate this control logic at about the 
equivalent of 24 MSI chips, the total MD4 array 
has an MSI chip count of about 72, so that it 
would be about 50% more expensive. It is also 
instructive to estimate the equivalent gate count 
in the AMD array as compared to a gate count for 
the MD4 which can be derived from expression (4a) 
above plus a reasonable full adder gate count. If 
we do this for the n = 16 case, we get an approxi- 
mate equivalent gate count of 4,400 for the AMD 
array and 5,300 for the MD4., 


The 56-bit floating point fraction multiplier 
and divider circuitry in the IBM S360/91 [1] com- 
puter has equivalent logic delays of approximate- 
ly 36t and 110t, respectively. The comparable 
figures for MD4 (including the Fast Adder) are 
38t and 185t. It should be noted that this com- 
mercial unit performs division by the iterative 
multiplication technique which is completely 
different from the MD4 technique, but makes very 
effective use of the multiplication structure. 
Detailed cost comparisons of this unit with MD4 
will not be attempted. 
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(a) The array 
Figure 1. MD1 ~ Multiplier/Divider iterative array 
| using Booth and Non-restoring algorithms ,for n=4, 


(d) Cell 5 (c) Cell 4 (b) Cell 3 


Figure 2, MD2 - Multiplier/Divider array using Booth and Non-restoring 
algorithms and carry-save along with carry lookahead, for n=4., 
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(a) The array 


Figure 2 (cont'd.). MD2 - Multiplier/Divider array using Booth and Non- 
restoring algorithms and Carry-save along with Carry 
lookahead, for n=4. 


(c) E function (b) Cell 6 


Figure 3, MD3 - Multipler/Divider array uSing multiplier bit pairing 
and 2-bit quotient generation, for n=5. | 
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Figure 3 (cont'd.). MD3 - Multiplier/Divider array using multiplier 
bit pair grouping and 2-bit quotient generation, for n=5., 
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Figure 4. MD4 - Multiplier/Divider array using multiplier 
bit pairing and 2-bit quotient generation with 
carry-save and carry lookahead technique, for 
n=6, 
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(b) Cell 8 (d) Cell CL 


Figure 4 (cont'd.). MD4 - Multiplier/Divider array using multiplier bit 
pairing and 2-bit quotient generation with carry-save 
and carry lookahead technique, for n=6, 


Table I: Cost and Delay Comparisons Among the MD Arrays 


Cost g Mult. Delay tT Div. Delay TT 


rer we [es [wor fear roe [ms [oe | wor [ae | wos [Hor 


8 48 51 
16 112 163 
32 224 579 
64 448 | 2,179 


Table II: Cost and Delay Comparisons Among the RE, CH, and MD4 Arrays 


e [me [mm | we [oe [or poe 


29 
70 
134 
262 
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A VERSATILE DATA MANIPULATOR 


Tse-yun Feng 
Department of Electrical and Computer Engineering 
Syracuse University 


Syracuse, N. Y. 


Summary 


The main deviation of a parallel processor 
organization from a conventional (sequential) one 
can be seen to be in the data manipulating 
functions which are defined to be the functions 
required for preparing appropriate operands for 
fetching, execution, and storing [1]. Thus, data 
manipulating functions involve unary operations 
and they can be classified in the following 
categories: permuting, replicating, spacing, 
masking, and complementing. 


The structure of a versettte data manipu- 
lator [2] is shown in Fig. 1 ‘ 


The basic circuit has an N-by-N array con- 
struction (or N2 cells). Each cell consists of 
four gates. The circuit can easily be partitioned. 
Thus, implementation of this circuit requires 
only one circuit type. At present state-of-the- 
art up to 8x8 cells and their decoders may be 
implemented on one chip. 


This data manipulator is capable of achieving 
all the data manipulating functions mentioned 
above. Furthermore, it can achieve not only 
these functions for 2's-power data sets (or strings) 
and replications, but non-2's-power functions as 
well. Such a data manipulator is particularly 
attractive in applications requiring extensive 
spacing functions. Thus operations such as 
counting, multiple additions, bubbling process [3], 
can all be easily achieved. It is also evident 
that the system availability or self-repairability 
can be improved or provided by applications of 
the spacing functions. 
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(a) 


It is noted that the complementing and 
comparison circuits which may be located at 
either the input or the output side of the 
structure are omitted from Fig. 1 for clarity. 
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AN ARRAY OF COMPUTING MEMORY CELLS 


E. Della Torre and Jorge Roitman 
Department of Electrical Engineering, 
McMaster University, 
Hamilton, Ontario, Canada 


Summary 


A memory cell has been designed and construc- 
ted as an element of a highly parallel general 
purpose computer of a modified SOLOMON structure 
[1]-[2]. This cell has been interfaced to a_ PDP- 
11/20 computer and can perform array operations 
under central processor asynchronous control. The 
cell size has been minimized leaving, however, the 
capability of computing certain transcendental 
functions and performing iterative calculations in 
either the integer or the floating point modes. 


The basic SOLOMON communication structure has 
been extended to include a ROW/COLUMN vector of 
cells. Each cell of the vector can communicate 
with all the cells of the corresponding row and 
colum. With such a structure, array operations, 
such as the matrix transposition and a solution of 
Laplace's equation by the Jacobi type methods or 
the SOR methods, can be achieved efficiently. The 
cells normally operate in unison under the central 
processor control. Each cell has, however, the 
capabilities of performing individual operations 
under certain conditions. The array can be micro- 
programmed to perform iterative computations in- 
dependent of the central processor until 
convergence condition has been achieved. 


consis- 
It ope- 
memory, 
word or 


Each cell is a triple-address machine 
ting of 15 words and arithmetic hardware. 
rates between words or bytes of its own 
or between one of its words or bytes and a 
a byte of another cell it can communicate with. 
The number of words was chosen so that various 
algorithms for computing transcendental functions 
can be implemented within a cell. This organiza- 
tion permits the simultaneous computation of cer- 
tain functions for sets of argument values. 


The addressing system has been designed to 
permit selecting a particular cell, a row, a 


certain 
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or all the 
three modes of 
direct, concatena- 


colum, the even rows, the odd rows, 
cells of the array. In addition, 
cell addressing are available: 


ted, and automatic allowing a very efficient way 

of the cell selection. The system operates by 

inhibiting all but the addressed cells. 
Associative-memory capabilities [3] can be 


eaSily incorporated to the system. The existing 
inhibit hardware can be used for detecting the 


cells in which certain specified conditions are 
satisfied. The addresses of those cells can be 
read out sequentially by incorporating a cell 


priority detection address system. 


The cell has been satisfactorily tested by 
using several algorithms. A multi-cell system has 
been simulated on a PDP-11/20 computer. A compiler 
has been written for the PDP-11/20 to translate 
the user mnemonic language into the appropriate 
contents of the 32-bit instruction register. Sev- 
eral standard subroutines have been provided for 
integer division, floating point arithmetic, and 
computation of transcendental functions. 
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AN EFFICIENT ASSOCIATIVE PROCESSOR 
USING BULK STORAGE 


Hubert H. Love, Jr. 
Equipment Engineering Divisions 
Hughes Aircraft Company 
Los Angeles, CA 90005 


Abstract -- A hybrid associative process- 
ing system using an MOS shift-register bulk 
memory is described, together with its applica- 
tion to large-scale fact-retrieval applications. 
The system fulfills several criteria for balanced, 
efficient design of highly-parallel machines. 

A comparison with similar machines using 
rotating memories is made. 


Introduction 


Ehe processor organization to be de- 
scribed here is an outgrowth of the Association- 
Storing Processor (ASP) project.(#) The 
object of the project effort was the develop- 
ment of processor organizations biased toward 
nonarithmetic applications. As a first step 
in this direction, a representative application, 
namely fact retrieval (i.e. , question-answer- 
ing), was chosen, The next step in the project 
was the development of a language, the ASP 
language, which expresses the data organiza- 
tions and the processes of concern in the 
application, Following this, three processor 
organizations were designed, using the lan- 
guage as a guide. The organization described 
here is related to the third of these [3], and 
is an attempt to achieve efficient operation 
of an associative memory when the data base 
resides in a large, inexpensive bulk memory. 


Speed, Cost and Balance 


The justification for the associative/bulk 
memory combination lies in the desire to 
simultaneously achieve higher processing 
speed and throughput, lower cost and a bal- 
anced, efficient system. Processing speed 
has been a particular problem in such sophis- 
ticated fact-retrieval applications as military 
strategic command and control and the trans- 
lation or interpretation of natural languages. 
This is because such applications involve very 
large data bases (at least the order of 10 
bits), and because very often (such as when 
deductive inference is used in the retrieval 
process) many retrieval operations must be 
performed and many records processed in 
order to answer a single query. The ability 
of associative memories to search and process 
data in a highly-parallel fashion makes these 
devices natural candidates for consideration, 


The large size of the data bases used in 
the applications of interest is the principal 
cost consideration in the processor design, 


(2) see references [1] through [3]. 


and is the justification for the use of an in- 
expensive bulk memory as the primary data 
storage medium, It is particularly important 
in this respect that the ratio of associative 
memory to bulk memory size be small, and 

at it not increase as the size of the data 
base increases, 


System balance and efficiency are closely 
related terms. A balanced system, as defined 
here, is a system in which no major part of 
the system normally waits for another part 
to complete its task. Balance is particularly 
important with respect to the associative 
memory in the system to be described and, 
to a lesser extent, with respect to the bulk 
memory. A system is said to be efficient 
if all principal subsystems are performing 
a non-trivial task all or nearly all of the time 
during normal operation. Both balance and 
efficiency directly affect cost and performance 
in any computer organization, and they are 
the keys to the design of the one to be de- 
scribed here. 


The system concept is developed around a 
hybrid associative-memory/mass-memory 
hardware organization, a data structure and a 
processing strategy. These three ideas shal 
be described in that order. . 


System Organization 


The general organization of the system is 
shown in Figure 1. The principal components 
are a set of associative memories and a bulk 
memory consisting of static MOS shift registers. 


The associative memories are conventional 
in organization and bit-serial in operation. 
Each word contains a 64-bit static shift register 
for the storage of data. Each associative mem- 
ory is capable of the following operations. 

1. A simultaneous comparison of the 
contents of every word in the memory with 
the contents of an external register, called 
the compare register. A flip-flop, called the 
match flip-flop, is set at each word satisfying 
the comparison. The operation is field-selective, 
with the fields being defined by the contents of 
another external register. 

2. An ordered serial retrieval or loading 
of those words having their match flip-flops set. 

3. A field-selective mass-write operation, 
in which the contents of an external register 
are written into the selected fields of every 
word having its match flip-flop set. 

4. The transfer of the states of the match 
flip-flops to the inputs of the data storage reg- 
isters for the corresponding words, and vice- 
versa. 
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5. Several logical operations involving 
the match flip-flops and two auxiliary sets of 
flip-flops, called the Tl and T2 flip-flops, 
whose functions will be described. 

6. A number of auxiliary operations, 
such as the setting and resetting of all match 
flip-flops. 


These operations are common to most 
"classical'' associative memory designs. 
The number of words in each associative 
memory (ten memories are shown in the 
figure) is a function of the size of a subset 
of the average record in the data base, as 
will be discussed. The shift rate for the data 
registers of the associative memories during 
parallel operation is a nominal 5 MHz. 


The bulk memory for the system consists 
of a set of individually-addressable MOS static 
shift registers.\#) These should be very large, 
at least 16,000 bits each, in order that the 
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size of the address encoding and decoding 
matrices, and thus the cost of the memory, 
be as low as possible. 


There are as many data transfer channels 
to the bulk memory (each channel bit-serial) 
as the number of associative memories. 


Those registers and only those registers assigned 


to a channel will shift their contents when data 
transfer commands are executed. This makes 
it possible to shift registers that are not 
involved in data transfers, by assigning them 


(Aphe newly-emerging charge transfer tech- 
nology may make such devices equally or more 
suitable as a bulk memory for this system, it 
being a requirement that it be possible to 
suspend the shifting operation for brief periods 
(100 msec.) without loss of information. The 
magnetic bubble memory is another potential 
candidate. 
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to data transfer channels and not enabling the 
outputs at the other end of the channels. 


The shift rate of the registers in the bulk 
memory is the same as that for the associative 
memories, that is, of the order of 5 MHz. 


The interface between the bulk memory 
and the set of associative memories is a 
switching network. This network permits 
each associative memory to be assigned to 
a data transfer channel from the bulk memory, 
and also permits the associative memories 
to be connected together in parallel in various 
combinations. This latter is accomplished by 
connecting the external registers of the 
associative memories in parallel and connecting 
the propagating channels (used for control of 
serial input and output operations on words in 
the memories) in series, This capability makes 
it possible for several of the associative mem- 
ories to operate as a single large associative 
memory when the amount of data requires it, 
or to operate individually in simultaneous 
independent operation. 


The remainder of the system organization 
consists of 

1. an instruction processor, which con- 
trols the execution of the special processing 
algorithms used in the retrieval and modifi- 
cation operations, These algorithms are stored 
in a read-only instruction algorithm memory. 

2. acontrol processor of more conven- 
tional organization, together with a random- 
access memory. This processor performs 
part of the control of the bulk memory opera- 
tion, and also controls input and output 
operations. 


Data Organization 


The data bases are e paeenucted from ordered 


triples, called relations. The three items 
in each relation are called, respectively, the 
subject, attribute and value of the relation. 
The relations are organized into records. 
Each record is constructed from all of the _ 
relations involving a particular item, called 
the head item for the record. Each data entry 
in a record consists of the other two items in 
a relation. The data entries are unordered. 
There is a record in the data base for every 
item in the data, the item being the head item 
for that record. 


In the records, each item is represented 
by a 24-bit number, called the item number. 
The user represents the item by a unique 
corresponding symbol string, called the item 
name. 


Since the size of the data records and the 
physical records (i.e., the shift registers in 
the bulk memory) are different, the data 
records must be segmented. Each segment is 
stored bit-serial on a physical record. 


(this is the data structure of the ASP 
language [2]. 
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All entries in every segment are 64 bits 
in length. The first entry in each segment is 
a header word containing the item number for 
the head item for the record. This is used in 
locating the segment and in identifying the 
corresponding record. One of the segments, 
called the head segment, contains the bulk 
memory addresses of all of the other seg- 
ments in the entries immediately following 
the header, The other segments each contain 
only the address of the head segment in the 
entry immediately following the header. The 
remaining entries in all of the segments are 
the data entries, each consisting of a pair of 
24-bit item numbers stored contiguously and 
left-justified in the entry. 


Operation of the System 


The principal function of the system is the 
selective retrieval from and modification of 
the data base. The criteria for the retrieval 
and modification are each specified by a set 
of relations, called the retrieval structure 
and the replacement structure, respectively. (>) 
In these structures, the known items are repre- 
sented by their item numbers. Unknown items, 
to be determined by the retrieval operation, 
are each represented by one of a set of special 
numbers reserved by the software for this 
purpose. 


Both the retrieval and replacement 
structures may contain unknown items. In 
the replacement structures, each relation 
specifies a set of relations to be inserted into 
the data base, Relations appearing in the 
retrieval structure but not in the replacement 
structure each specify a set of relations to be 
deleted from the data base. 


The central process from which the re- 
trieval operation is constructed is that of 
context addressing an unknown item. This 
is the process of identifying all items in the 
data that satisfy the 'context'' of relations 
in which an unknown appears in the retrieval 
structure. An item is said to satisfy this 
context if, for every relation in the control 
structure containing the given unknown, there 
corresponds a relation in the data containing 
the given item in place of the unknown, and 
which is otherwise identical. If thereis 
another unknown in the relation in the re- 
trieval structure, a relation in the data is 
considered ''identical'' for any item corres- 
ponding to that unknown, if it is identical 
otherwise. 


Only the retrieval operations will be 
described here. The data modification 
operations are described in [3] for a similar 
system. 


All retrieval operations are performed 
by first context-addressing all of the individual 


() This is also the structure of the ASP 
language [2]. 
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unknowns in the retrieval structure, and then 
resolving any relations involving more than 
one unknown. For brevity in describing 

the processes, all unknowns in the following 
discussion will be either items or values, 
but not attributes. The processes can easily 
be extended to cases in which an unknown 

is an attribute. 


To perform a context-addressing operation, 
two different processes are used. They are, 
respectively, the Load Subrecord operation — 
and the Compare Record operation. To de- 
scribe these operations, a retrieval structure 
consisting of several relations involving a 
single unknown shall be assumed. The un- 
known item shall be assumed to be either the 
subject or the value in each one of these 
relations. 


The Load Subrecord Operation 


To begin the context-addressing operation, 
one of the relations in the retrieval structure 
is selected (at random, if desired), and the 
head segment of the record for the subject 
or value (one of these will be a known item) 
is accessed in bulk memory. The Load Sub- 
record operation is then executed to load a 
subset of the record into one or more of the 
associative memories. This subrecord con- 
sists of all record entries which contain the 
same attribute as the relation from the re- 
trieval structure. The operation is the 
following. 

1. As the shift register containing the 
home segment is shifted, one entry at a time, 
those entries for which the attribute is the same 
as the attribute of the corresponding relation 
from the retrieval structure are selected and 
loaded into one of the associative memories. 
The selection is made by comparing each 
entry with the contents of a register called 
the selection register. With each entry so 
loaded, the T1 flip-flop in the word is set. 
This tag bit, at the completion of the context 
address, will be set at all entries con- 
taining values of the unknown item. At the 
same time, the addresses of the other seg- 
ments of the record are retrieved (from the 
home segment) and the shift registers con- 
taining them are shifted to make the segments 
available for processing. 

2. As the processing of a segment is 
completed, and as another segment becomes 
available at the output of a shift register, 
the process is repeated for that seg- 
ment. If the associative memory becomes 
filled, another associative memory is selected 
by the system and is loaded in turn. 


When the processing of all segments 
of the record is completed, the associative 
memory or memories will contain the entries 
for all values for the unknown item that are 
specified by the retrieval structure relation. 
If the retrieval structure contained only 
that one relation, the entire context-addressing 
operation would now be completed. 
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The Load Subrecord operation is illus- 
tration in Figure 2. The example is shown for 
the retrieval structure relation (A, Rl, X), 
in which the unknown item is represented 
by the X. That relation is also shown in the 
upper left-hand part of the figure in directed- 
graph form (which is the ASP language repre- 
sentation). The record being processed is 
the one having the item A as the head item. 


Five contiguous entries from the record 
are shown in the figure. These are the entries 
(R1, Bl), (R2, Bl), (R1, B3), (R9, BY) and 
(Rl, B6). Inthe illustration, the first and third 
of these entries have already been selected and 
loaded into the associative memory. The input 
(i.e., compare) register is shown containing 
the most recently loaded entry. 


The selection criterion (which is that 
the attribute of the entry be the item Rl) 
is shown as the contents of the corresponding 
field of the selection register, with the symbol 
"D/C", representing ''don't care'', shown in 
the other fields. Inthe associative memory, 
the column labeled T1 represents the T1 flip- 
flops, which are set for each loaded entry. 
The column labeled MFF (match flip-flop) 
shows the flip-flop set for the most recently 
loaded entry. This represents the use of the 
match flip-flop to identify a single word in the 


associative memory for which some operation 


(in this case, the load operation) is to be 
performed. 


The Il] andI2 fields in the associative 
memory words are the 24-bit fields for the 
item numbers for the other two items (that is, 
other than A ) in each data entry. At the con- 
clusion of the operation, the [2 fields will 
contain the values of the unknown item which 
satisfy the relation (A, Rl, X). 


The operation establishes the criterion 
for the size of the associative memories, 
which should be that of the average subrecord 
in the data. base, rather than the size of the 
entire record. 


It can be seen that the Load Subrecord 
operation is essentially balanced, in that 
neither the associative memory nor the bulk 
memory must wait for the other to complete 
an operation. The only exception is the delay 
in accessing the home segment, and a possible 
delay in accessing another segment. The 
operation, however, is not efficient, since the 
associative memory is not performing a paral- 
lel operation, but is only being loaded. As will 


be seen, the remaining operations in the context- 


addressing process are both balanced and 
efficient. 


The Compare Record Operation 


During the Load Subrecord operation, 
the home segment of the record corresponding 
to one of the other relations in the retrieval 
structure (that is, having the known subject 
or value from the relation as its head item) 
is being accessed. When the Load Subrecord 
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operation is completed, this second record 
is processed, entry-by-entry, against the 
contents of the associative memory. This 
operation is called the Compare Record 
operation and, for each entry, is as 
follows. 

1. The attribute of the entry is compared 
with the attribute from the corresponding re- 
trieval structure relation. As in the Load 
Subrecord operation, the selection register 
is used in this process. 

2. At the same time, the entry is com- 
pared simultaneously with all entries in the 
associative memory (i.e., the subrecord 
already loaded), comparing only the value 
fields and the T1 flip-flops. If both com- 
parisons ] and 2 are successful, the match 
flip-flop is set at each matching entry in the 
associative memory. (All match flip-flops 
are reset before the first entry is processed. ) 


After the last entry in the record has 
been so processed, the T1 flip-flops and the 
(corresponding) match flip-flops are logically 
ANDed together, and the results stored in the 
Tl flip-flops. Those entries that now have 
their T1 flip-flops set are the entries which 
contain, in their [2 fields, all values of the 
unknown item that satisfy both of the re- 


trieval structure relations thus far processed. 


If thereare no other relations in the retrieval 
structure, the context addressing operation is 
now completed. 


The Compare Record operation is illus- 
trated in Figure 3, which shows a retrieval 
structure of two relations. The first of these, 
(A, Rl, X), is shown as already having been 
processed, using the Load Subrecord opera- 
tion. The second relation in the retrieval 
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structure is the relation (B, R2, X), and the 
record shown being processed, using the 
Compare Record operation, is the record for 
the item B. The selection register is shown 
containing the attribute from the relation in 
its I2 field. 


Five contiguous entries from the record 
for B are shown, with the first three of these 
having already been processed. It is seen 
that the first and third of these entries have 
satisfied both compare operations, and the 
match flip-flops are set at the second and 
fourth words in the associative memory as 
a result. 


If there are more than two relations in 
a retrieval structure involving a given unknown, 
the third, fourth, etc. of the relations are 
processed exactly like the second, using 
the Compare Record operation. 


It is seen that the Compare Record 
operation is both balanced and efficient. 
It is balanced in the same way as the Load 
Subrecord operation, and it is also efficient 
since, unlike the former operation, the 
associative memory is performing parallel 
compare operations for every entry in the 
record. Moreover, the access delays ex- 
perienced in connection with the processing 
of the first record will seldom if ever be 
encountered for the remaining records. 
This is because all of the records involved 
in the context of the unknown are known at 
the start of the context-addressing operation, 
and thus can be searched for simultaneously. 
By the time the first record is processed, 
at least one of the other records will be 
accessible for processing in turn. 


1973 SAGAMORE COMPUTER CONFERENCE ON PARALLEL PROCESSING 


RECORD 
FOR B 


COMPARE 
REGISTER 


SELECTION 
REGISTER 


Figure 3. 


If a replacement structure contains several 
unknowns, the system design permits several 
of them to be context-addressed simultaneously 
in the fashion just described. This is a result 
of having more than one associative memory 
and more than one data transfer channel 
between the bulk memory and the associative 
memories, The simultaneity is limited only 
by the number of associative memories and 
by the fact that more than one associative 
memory may be required for the selected 
entries from an unusually large record (se- 
lected by the Load Subrecord operation). 


If there are no relations in the retrieval 
structure involving more than one unknown, the 
process of identifying all values of all unknowns 
in the retrieval structure can be accomplished 
by the processes already described. If there 
are such relations, a number of other operations 
have been defined. All of these operations 
require the use of more than one associative 
memory, and also special logic for operating 
on the success/fail results of the various 
comparison operations. All of these operations 
are performed, when applicable, after all 
individual unknowns have been separately (and 
simultaneously) context addressed. Three 
of the operations shall be described here. 

Each of them applies to a particular con- 
figuration of interrelated unknown items. 
For other configurations, the corresponding 
operations can be derived by reference to 
these. 


The Find Pairs Operation 
The first of the operations for interrelated 


unknown items, called the Find Pairs operation, 
identifies all corresponding pairs of values for 
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two unknown items that are in a single re- 
lation in the retrieval structure. Each of these 
pairs corresponds to some entry in that record 
that has the attribute from the retrieval 
structure relation as its head item. These 
matching entries each represent relations 

in the data which have the same attribute 

as the retrieval structure relation, and 


whose subject and value are each candidates 


for the respective unknown items in the 
retrieval structure relation. The candidates 
are those items that have been identified by 
the earlier context addressing of the two 
unknowns, 


Once the entries for the corresponding pairs 
of values have been identified, one or more of 
the following operations is performed. 


1, The entries themselves are tagged 
directly in the record in the bulk memory by 
setting bits in the tag fields of the entries 
(bits 48-63). 

2. The entries are written in an unused 
associative memory for later processing. Ex- 
amples of such processing will be shown. 

3. The entries are written into an unused 
(blank) physical record in the data base for later 
processing. 

4. The entries are retrieved for output to 
the user (assuming that the entire retrieval 
operation has been completed with the comple- 
tion of the current operation). 


Figure 4 shows the hardware configuration 
of registers, associative memories and data 
records used in the Find Pairs operation, and 
illustrates the use of the operation in connection 
with an example retrieval structure. The figure 
shows three of the associative memories from 
the system, labeled AM#1, AM#2, and AM#3. 
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AM#1 and AM#2 each contain entries from the 
record that was processed first in context 
addressing one of the two unknown items. AM#l 
contains the entries involving the subject of the 
relation involving the two unknowns. AM#2 
contains the entries involving the value of the 
relation. The configuration in the figure also 
includes part of a record, drawn as though it 
were a tape, that is being processed against 

the contents of AM#1 and AM#2. The third 
associative memory, AM#3, is a spare memory 
that has been assigned for holding the match- 
ing entries from the record (i.e., the corre- 
sponding pairs of values for the two unknowns) 
if that option is specified in the operation. 


To begin the Find Pairs operation, AM#3 
is cleared, and the match flip-flops in all of 
the associative memories are reset, Follow- 
ing this, the first match flip-flopin AM#3is set, 
''Don't care'' (D/C) conditions are put into the 
compare registers of AM#1 and AM#2, as 
shown, and the Tl fields of all three compare 
registers are set. 
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Now each entry from the record for the 
attribute of the retrieval structure relation is 
processed as follows. 


1. The contents of the I] field in the entry 
are transferred to the I2 field of the compare 
register of AM#l. The contents of the I2 field 
of the entry are transferred to the Il field of the 
compare register of AM#2. Following this, the 
record is shifted to the next entry. 

2. A simultaneous compare operation is 
performed on both associative memories at the 
same time, and the two sucess/fail conditions 
are ANDed together. (The success condition is 
indicated by the setting of at least one match flip- 
flop in an associative memory.) 

3. If both compare operations are success- 
ful, the current entry (which is now known to 
contain a corresponding pair of values for the 
two unknowns) can be copied into the current 
entry position of AM#3 (the entry being defined 
by the setting of the match flip-flop). Or, if 
desired, the entry can be copied into another 
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record, reserved for the purpose, in bulk 
memory. As a third alternative, the entry can 
be tagged directly in the record in its T1 field 
(or in one of bits 48-63). 


After all of the entries in the record have 
been processed as just described, those entries 
that contain corresponding pairs of values for 
the two unknowns will have been identified and 
tagged and/or copied. 


The example illustrated in Figure 4 illus- 
trates the Find Pairs operation for the retrieval 
structure consisting of the five relations (A, Rl, 
X1), (B, R2, Xl), (Xl, R5, X2), (C, R3, X2) and 
(D, R4, X2). The two unknown items are repre- 
sented by the symbols X1 and X2. AM#1 con- 
tains the candidates for Xl, as determined by 
the context addressing of X1. They are in the 
12 fields of those entries that have tag Tl set. 
Similarly, AM#2 contains the candidates for X2. 


Six entries from the record for R5 are 
shown being processed against the contents of 
AM#1 and AM#2. R5 is the attribute of the 
relation that interrelates the two unknowns. 
Each entry in that record contains pairs of 
potential values of X1 and X2, The first three 
entries have already been processed, and it is 
seen that the first and third of these have 
matched. They are both tagged in the T1 fields 
of the entries themselves and have also been 
copied into AM#3. In particular, the third 
entry has just been tested, and the match flip- 
flops in AM#1 and AM#2 are set at the match- 
ing entries. 


The Process Threes and Process 
Fours Operations 


There are a number of possible retrieval 
structure configurations which involve three 
or more interrelated unknown items. For 
each such configuration there is a corre- 
sponding instruction with its hardware config- 
uration and processing algorithm. The 
hardware configurations for two of these 
instructions, the Process Threes and Process 
Fours instructions, are shown in Figures 5 
and 6, respectively, together with example 
retrieval structures. The two figures are 
given for illustration only. The operations 
themselves are described in detail in refer- 
ence [3] fora similar system. Only a brief 
discussion is given here. 


The Process Threes operation determines 
pairs of corresponding values of two unknown 
items that are indirectly related in a retrieval 
structure through a third unknown item. (In 
the example in Figure 5, the third unknown 
item is X2.) The Process Fours operation 
determines such pairs of values for cases in 
which the two unknowns are related through 
two intervening unknown items (X2 and X3 in 
the example in Figure 6.) 


Both operations are performed after all 
corresponding pairs of values for the directly 
related unknown items in the retrieval structure 
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have been determined using the Find Pairs 
operation. These corresponding pairs have 
been variously stored on blank records or in 
associative memories as required for the 
current operation, 


For the Process Threes operation, only 
one associative memory is required. (The 
second one shown in Figure 5 is for optional 
storage of the corresponding pairs determined 
by the operation.) The Process Fours oper- 
ation requires two associative memories. 


Cost and Performance Considerations 


The associative/shift-register system is 
essentially a balanced system with respect to 
its two primary subsystems, the associative 
memories and the bulk memory. The shift 
rates for both memories are the same, and 
both memories are kept operating at or near 
that shift rate during normal operation. 


Access delay is small in all processing of 
records from bulk memory. This is because 
all segments of a record except the first can 
be accessed nearly simultaneously and, once 
accessed, can be kept in readiness for immedi- 
ate processing. There is a delay in accessing 
the home segment; this averages 1.6 msec. .. 
for the 16, 000-bit registers in the bulk mem- 
ory. Access delays for the remaining seg- 
ments of the record, and for the segments of 
any other record being accessed at the same 
time for later processing, will be small or 
nonexistent. 


As an example, consider a record con- 
sisting of 640 segments divided into ten seg- 
ments of 64 data entries each. The total 
processing time for such a record (for the 
Load Subrecord and Compare Record operations) 
will be very close to the 1.6 msec. average 
access time for the home segment plus 0.8msec. 
for processing each segment, a total of 9.6 msec. 


For rotating memories, the limitation on 
processing speed is largely a function of the 
rate of rotation, since the instantaneous data 
transfer rates for modern fixed-head disks and 
drums are high. For such memories, the 
processing of most records will require an 
entire revolution (33 msec. for the typical disk 
rotating at 1800 rpm.) plus a fraction of a disk 
revolution for accessing the first segment to 
be processed, \# 


A large part of the advantage of using shift 
registers rather than rotating memories lies 
not in the increased speed but in the relative 
simplicity of system design and operation. 


(a) 


It is assumed that if a disk is used as the bulk 
memory, every segment of a record would 
contain the addresses of the remaining seg- 
ments. If cueing of access were then used, 
one of the segments could be accessed in an 
average of 1/n +1 of a disk revolution, where 
nis the number of segments. 
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Figure 5. Process Threes Operation 


Shift registers, for example, do not require © . The present ratio of costs for MOS shift 
cueing of accesses in order to minimize aver- registers to fixed-head disks is of the order of 
age access time per record. And they do not 9to 1. At this ratio, the sacrifice in process- 
require the related buffering or the processing ing speed, processing and buffering costs and 
effort needed to handle the buffering and cueing design effort when disks are used may still be 
operations. Moreover, such elaborate tech- justified. However, with the reduced costs of 
niques as deferred modification\*’ are much LSI to be expected in the near future (charge- 
less needed when shift registers are used. transfer devices costing about 1/4th the cost 


of disks are being announced), the simpler 
shift-register memory should be considered in 
any present effort to achieve a balanced asso- 
ciative system. 


(4)This is a technique for increasing record- later operations. In this way, subsequent 
processing throughput in which modifications operations on the data base can proceed 
to the data base are made ina reserved immediately. This technique requires that 
region in fast memory as soon as they are all search operations must include a search 
determined, rather than in bulk memory. of the buffer as well as the bulk memory, 
The main data base is then modified later and is in general very complicated to 
from the contents of this buffer, overlapping implement. 
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Abstract -- The use of two levels of 
parallelism facilitates the design of an effi- 
cient programmable signal processing computer. 
At the system level, multiple functional units 
(multiprocessors) perform distinct functional 
tasks such as data gathering, data organization, 
and signal transformation. At the implemen- 
tation level, horizontal microprogrammed control 
of parallel resources effects flexible and effi- 
cient processing. 


Introduction 


Modern signal processing systems perform 
many tasks by sampling analog signals and 
transforming the sampled digital data. [In 
systems like radar and sonar there is typically 
so much information to analyze that it has been 
necessary to develop special-purpose hardwired 
devices to sample and transform the data in 
real time. The emergence of LSI circuit tech- 
nology and high speed memories provides the 
capability of developing programmable signal 
processors which would reduce proliferation of 
special purpose devicés, reduce the manufact- 
uring cost (by economies of scale), and simplify 
maintenance, The use of two levels of paral- 
lelism facilitates the design of such a program- 
mable signal processor [1]. 


Parallelism at the System Level 


At the system level, an efficient signal 
processing computer assigns distinct processes 
to different functional units which operate in 
parallel. For system supervision and simple 
data organization and transformation, the 
system employs a sophisticated controller. For 
Signal transformations a specialized arithmetic 
processor is required. Additional functional 
units collect and store data and control com- 
munication among other units. In the AN/UYK-17 
(AADC/SPE) computer [2] (see Figure 1) separate 
functional units perform such distinct pro- 
cesses. | 


The Microprogrammed Control Unit (MCU) is 
the system controller. Its functions include 
data management, process scheduling, I/O con- 
trol, interrupt handling, and some applications 
routine processing. The MCU massages (e.g. by 
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ordering or scaling) signal information, 
placed in buffer memories by I/O devices, into 
a form amenable to transformation by signal 
processing algorithms, and leaves it in buffer 
memory for processing by a special purpose 
arithmetic unit. After the data is trans- 
formed, the MCU may perform some post pro- 
cessing functions and store information for 
later retrieval. The MCU also performs system 
functions such as handling operator requests, 
controlling displays, etc. 


The Signal Processing Arithmetic Unit 
(SPAU) is the system arithmetic processor. Its 
function is to perform high speed execution of 
processing operations on arrayed data. These 
operations include spectrum generation, convo- 
lution, correlation, and digital filtering. 
SPAU processing is scheduled by the MCU. 

After SPAU processing is initiated, the SPAU 
operates independently of the MCU. 


The AN/UYK-17 contains up to eight buffer 
storage modules (BSMs), which provide central, 
high speed (150 nanosecond cycle time) memory 
for the system. Each BSM contains 4096 32-bit 
words. The BSMs provide storage for MCU execu- 
tive and application data tables, system data 
arrays, working storage for MCU and SPAU pro- 
cessing operations, and buffer areas for 1/0 
data movement. Because the MCU supervises 1/0 
operations and storage of data in the BSMs, 
the SPAU need not consider the problems of 1/0 
processing; its data resides in the high speed 
BSMs. 


To provide a fast and flexible means of 
moving data between BSMs and peripheral 
devices, the AN/UYK-17 contains one or more 
selector channel controllers (SCCs). The stor- 
age control unit (SCU) provides a switching 
interface between the independent BSMs and 
other system components. Because each MCU, 
SPAU, and SCC can access BSMs every clock 
cycle, the SCU switches each unit to the BSM 
it is addressing and resolves conflicts by a 
priority mechanism. 


To provide general intermodule communi- 
cation, the AN/UYK-17 contains a connector 
called the Z-bus which consists of sixteen 
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bidirectional data lines and fourteen control 
lines, In addition, there is a sophisticated 
interrupt system through which system com- 
ponents and peripheral devices can notify cen- 
tral control, the MCU, of changes in their 
operation or status. 


Parallelism at the Implementation Level 


In the implementation of system components 
that perform multiple functions of a similar 
nature, the use of parallelism can significantly 
improve the performance of the components and 
hence of the system. In the AN/UYK-17 the MCU 
and the SPAU contain several resources that 
operate in parallel. User written horizontal 
microprograms control these resources. 


Parellism in the SPAU 
The capability to effect the second order 


recursive filter and the FFT butterfly is funda- 
mental in signal processing [1,3]. Figure 2 shows 
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the general configuration of a second order 
recursive filter [1]. Z71 in Figure 2 repre- 
sents a unit delay while the circles indicate 
addition or multiplication by a constant. The 
output y at any time can be described in the 
following two step computation, which uses the 
labels defined in Figure 2: 


Wo =x - BW, - BM, 
y= Wy t AW + AaW, 


The data flow graph shown in Figure 3 follows 
from these equations. Squares in Figure 3 
represent data items and circles represent 
multiplication or addition. "p," "q," "r" and 
"s" are intermediate data items, while Tj and 
To represent delay operators. Note the possi- 
bilities of performing the operations. in 
parallel. 


Figure 4 shows the arithmetic section of 
the SPAU. As in the data flow graph of 
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Figure 3 there are four multipliers. These 
multipliers operate in parallel and produce a 
product every clock cycle (150 nanoseconds). 
Although there are four adders, the results of 
adders one and three may be inputs to adders 
two and four, respectively, in the same cycle; 
so two pairs of additions can be performed con- 
secutively in a single cycle. This is equiva- 
lent to two three-input adders. The X and Y 
local stores are used to store intermediate 
results and to hold data that have been read 
from or will be written to BSMs. The adders, 
buffer reads and writes, and additional register 
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Data flow graph for second order recursive filter 


transfers operate in parallel with each other 
and with the multipliers, 


Fundamental to the computation of the 
fast Fourier transformation (FFT) is the FFT 
butterfly [4]. For data points represented as 
complex numbers, Figure 5 shows the data flow 
graph for computing the FFT butterfly. In 
this figure X(i) and X%,(j) are data inputs to 
the butterfly while X(i+1l) and X,(j+1) are 
output items. W* is a weight term, where 


onk - fenk\ _. _ 
We = cos (228) and Ws = -sin 4) with N = the 
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total number of input data points to the FFT. 
By considering the data flow graph in Figure 5 
two conclusions can be reached that may assist 
computation: 


1) No memory cell in the data flow graph 
is reused by subsequent operators. 
Hence, the computation may be executed 
in a pipelined fashion. 

2) The left-right symmetry in the data 

flow graph permits an increase in 

throughput when the input data points 
are all real rather than complex. 


Referring again to Figure 4, the effect of the 
FFT butterfly computation on the design of the 
SPAU arithmetic section is apparent. 


Note that the schemata for the computation 
of the FFT butterfly and second order recursive 
filter were similar enough so that the same 
hardware could easily be used to effect both 
algorithms. Register selectors facilitate dyna- 
mic reconfigurability. Control of the SPAU is 
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ZY 


TO M/S and ROM 


SPAU arithmetic section 


effected by horizontal microprograms; 160 bit 
microinstructions which contain 63 fields con- 
trol the resources of the arithmetic section 
and also the addressing section (which has 
three independent address formation units for 
computing buffer and ROM addresses) and 
sequencing mechanism, 


Parallelism in the MCU 


Like the SPAU, the MCU (see Figure 6) is 
controlled by horizontal microinstructions which 
execute in 150 nanoseconds, the system cycle 
time. Microinstructions in the MCU contain 64 
bits which define seventeen fields. These 
fields control specific MCU facilities: 


1) buffer input and output 

2) source and destination register 
selection for the ALU/shifter 

3) AlU/shifter operation 

4) interrupt control 
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Xmor fi) 
Figure 5. 
5) auxiliary register transfers 
6) sequence control. 


As in the SPAU, the facilities operate in 
parallel. Thus throughput is normally greater 
than for standard systems, For example, the 
MCU can transpose a 40 by 0 pabris Aavelyine 
3200 memory references) in less than 1700 
cycles. 


AN/UYK-17 Configurations 


Because the AN/UYK-17 system provides 
general intermodule communication facilities 
(the Z-bus and the interrupt capabilities), 
system components can be configured in a 
variety of ways. The basic simplex system 
consists of an MCU, a SPAU, an SCU, four BSMs, 
and an SCC. Additional components may be 
connected; an example (see Figure 7) follows 
the architecture of the CDC 6600 computer. One 
or more MCUs can serve as peripheral processing 
units that control I/O devices. A master MCU 
can serve as a (scoreboard) scheduler which 
manages the buffer memories and schedules the 
operation of the parallel functional units, 
i.e., the SPAUsS. The SPAUs execute various 
arithmetic processes, communicate only with 
the high speed buffer memories, and are 
Subservient to the master MCU. 
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ASYNCHRONOUS NETWORK OF SPECIFIC MICROPROCESSORS 


Francois DROMARD and Gérard NOGUEZ 
Institut de Programmation 
Université PARIS VI 


4 place Jussieu 75005 Paris 


Summary 


This paper is a summary of another one [1] ,[ 2] 
where the multimicroprocessors architecture 
(MICROPUS) is described. The attempt here is to 
clarify the exchange mechanism between the micro- 
processors. 


The studied network is designed on a local 
scale. It is composed of: 

1) microprocessors 

2) paths between them. 

The microprocessors have specific functions 
( and sometimes a specific structure). They have 
their own storage which contains their specific 
data and working area. 

At a logical level, a task consists of a ser 
quence of specific sub-tasks.Each sub~task is pro- 
cessed by one and only one microprocessor.Only one 
sub-task can be processed at a given time.So that, 
at any time, a task needs just one resource: a mi- 
croprocessor or a path (in order to be transmitted 
to the next microprocessor capable of processing 
next sub-task). At the same time,it is possible to 
have several tasks in the network. 

The exchange between microprocessors are asyn- 
chronous. It implies the paths are buffered. The 
network local scale allows designing a common 
‘mechanism to manage all the paths. this exchange 
set is composed of a finite number of: 

1) containers 

2) stations. 


The stations can be attached to the micropro- 
cessors or can be used to collect free containers 
(collectors). A path is the connection between two 
stations.Only one container can stay in a station. 
The others are waiting for on one or more paths. A 
microprocessor can have one or several stations.In 
order to transmit information to another one,a mi- 
croprocessor must use a container staying in one 
of its own stations. If there is none, it can re- 
quest a container to a collector.If there is no 
free container, the microprocessor cannot emit and 
has to wait for the free container collecting. Du- 
ring this waiting time,no container is kept by the 
microprocessor.So, there is no deadlock. 


Such an exchange mechanism can be implemented 
in two different ways using: 

1) semaphores and queues 

2) double linked looped lists. 


FRANCE 


The first way is simpler than " producer-con- 
summer" algorithm, because semaphores are associa- 
ted to the stations and not to the paths. Only one 
semaphore is attached to each station.Transmitting 
a container involve dacrementing (P operation) the 
emitter station semaphore and incrementing (V ope- 
ration)the receiver station one. The sum of sema- 
phores values is constant and equal to the initial 
containers number.A collector station also has its 
semaphore processed like the others. 


In the second way, the exchange mechanism is 
made of a double linked tist memory. -A. looped list 
ig associated to each station. Such a list heading 
stitch points to: 

1) the stitch attached to the staying 
container (downstream link). 

2)the stitch attached to the last waiting con- 
tainer of the station (upstream link). 

Let S the stations number and C the containers one. 
There are S lists and the list memory contains ~at 
most~ S+C stitches. 

There are two basic operations: 

1)checking a list is empty or not (demand). 

2)transmitting a container from a list to an 
other one (supply). 


station 


The demand consists of testing if the heading down 
stream link points to itself or not.Transmitting a 
container (supply) involves the following linking 
operations: 

l)extracting first container from emitter list 
and looping this list on itself. 

2)inserting this container in receiver list 
(pointed by the heading downstream link). 
Getting a free container then needs the following 
operations: 

1)the microprocessor demands to the collector. 

2) loops until this list is not empty. 

3)does a supply operation from the collector 
list to its own one. 
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AN APPROACH TO A RESTRICTED SCHEDULING-PROBLEM 
FOR MULTIPROCESSOR SYSTEMS 


Sigram Schindler and Harald Ltdtke 
Fachbereich 20 (Kybernetik) 
Technische Universitat Berlin 


Abstract 


The problem of scheduling N tasks - the ope- 
rational precedence structure, «+, of which is re- 
presented as a finite, acyclic, directed, weighted 
graph G - on a multiprocessor system consisting of 
M identical processors is studied. The weight Wr 
of node I, 1S ISN, we regard as the processing 
time of the task represented by node I, and we 
want all N tasks to be processed completely within 
total processing time CT. We assume that no pre- 
emptions are allowed. Memory for instructions and 
data is assumed to be infinitely large. Processor 
switching time is neglected. 

In this paper some results are derived for 
the case that + can be put together.from forests 
and antiforests in a simple way. For the case that 
+ is the disjoint union of a i-tree and a 1i-anti- 
tree, the set of all suitable schedules is gi- 
ven for arbitrary M and CT. 


I. Introduction 

The paper investigates the problem of sche- 
duling M identical processors if the computational 
work to be done is known in advance and if memory 
(or channel) requirements can be neglected. The 
computational work for the processors is described 
by a finite set of 'programs', or 'tasks' which 
have to be executed (or processed) and each of 
which can be assigned for execution to an arbitra- 
ry one of the M processors. Such an assignment can 
last until the task is completely executed or its 
execution can be interrupted because the executing 
processor is needed for another task which has no 
processor. Schedules for the processors that allow 
such interrupts are called preemptive schedules; 
schedules that do not allow interrupts are called 
nonpreemptive schedules. The set of N tasks Ti, 
1 <Si<N, their execution times and their opera- 
tional precedence structure, +, are represented 
by a finite, acyclic, weighted, directed graph G 
(abbreviated as FAWD G). The N nodes of this FAWD 
G stand for the given tasks Tj, the weight Wj of 
task Tj, 1S i<N, is regarded as its processing 
time, sometimes called length of the task Tj. We 
assume that all tasks in G have positive lengths. 
We can assign a processor to a task (and vice ver- 
sa) iff the task is free, i.e. it has no predeces- 
sor . AS soon as a processor is assigned to a 
task it starts reducing the length of the task, 
i.e. processing the task; this reduction of the 
length of a task takes place with a constant, po- 
sitive and finite speed. If the length of the task 
is reduced to O, the task together with all its 
outgoing arrows is deleted from the graph. The 
units of length and time are determined such that 
a processor reduces a task by one unit of length 
in one time unit. 
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For given FAWD G and M processors we are in- 
terested in CTmin(G;M), i.e. the minimal total 
processing time for G by M processors. Given fur- 
thermore an upper bound CT 2 CTpmin(G;M) for the 
total processing time of G by the M processors we 
are interested in not only a single schedule but 
in the class A(G;M;CT) of all schedules that meet 
these conditions. This latter interest arises from 
the attempt to take into account further parameters 
of a computer, like for example, memory size, 
channel, transfer rate and memory control, and not 
only the number of processors, M. 

The foregoing model is obviously not suit- 
able to describe problems of effective resource 
utilization in today's general purpose computers. 
But it seems reasonable for the investigation of 
the processor allocation problem in a computer 
system of SIMD-type or MIMD- type (see [4]), if a 
few complexes of programs have to be processed 
very often by this system. The importance of the 
processor allocation problem in such systems can 
be derived from [16,17,18], where it is shown that 
processor utilization tends to be lower than 30 % 
if scheduling considerations are omitted. 

At the moment the problem formulated above 
cannot be solved effectively in full generality. 
Moreover it was shown recently in [19] ana [20] 
that probably no algorithm exists at all for com- 
puting an element from A(G;M;CTmin(G;M)) for which 
the number of steps is bounded by a polynomial in 
N. This result that the problem of determining 
time-optimal schedules (i.e. elements from 
A(G;M;CTmin(G;M))) probably cannot be solved effec- 
tively in the general case even holds if certain 
restrictions ([20]) are imposed on the problem. On 
the other hand there exists a long list of results 
saying that - due to other restrictions - the pro- 
blem to determine time-optimal schedules is poly- 
nomially solvable in the cases investigated ((1,2, 
3,6,7,8,10,11,21,22,23]). 

In order to make clear how the results of 
this paper are related to previous work we now 
discuss this point in some more detail. Restric- 
tions can be put upon the general problem by 
specializing the number of processors 
(e.g. to M=2), 
specializing the weights 
(e.g. to Wi=1, 1 SiZN), 
specializing the precedence relation, 
<«, of G (e.g. to G being a tree), 
forbidding preemptions. 

Polynomial bounded algorithms to determine 
CTmin(G;M) and an element from A(G;M;CTmin (G;M) 
are derived for the case 

M=2, W;=1, 1S i <N, preemptions forbid- 
den, arbitrary <« in [6-8]; 

M=2, arbitrary Wj, preemptions allowed, 
arbitrary <« in [24] and [21], furthermore 
A(G;2;CT) for arbitrary CT is given 

in [21]; 
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arbitrary M, Wy=1, 1S i<N, preemptions 
forbidden, G is a forest(a@) in [1], and 
in [23] if G is an anti-forest‘?; 
arbitrary M, arbitrary Wy, preemptions 
allowed, G is a forest in [2], ana [11], 
furthermore A(G;M;CT) for arbitrary CT is 
given in [10]; 

arbitrary M, arbitrary Wy, preemptions 
allowed, G is an anti-forest in [23], 
where A(G;M;CT) for arbitrary CT is gi- 
ven, too. 


The importance of these restrictions can be 
seen from [20], where the cases 
M=2, arbitrary Wi, preemptions forbidden, 
+ empty; 
M=2, Wy=1 or 2, 1 i <N, preemptions 
forbidden, + arbitrary; 
M arbitrary, Wi= 1, 1S iS N, preemp- 
tions forbidden, « arbitrary; 
M arbitrary, Wy arbitrary, preemptions 
allowed, + arbitrary 
are investigated. There it is shown that - even 
with these restrictions - the problems listed are 
"polynomial complete' (see [19]), i.e. essential- 
ly that a polynomial bounded algorithm to deter- 
mine a time-optimal schedule for such a problem 
would provide us with quite many polynomial boun- 
ded algorithms to solve well known problems for 
which polynomial bounded algorithms are not known 
today. 


< 


From this short survey we see especially 
that 

- the nonpreemptive time-optimal scheduling 
problem for a general FAWD G with W,=1, 
i Si <N, and arbitrary M is polynomial 
complete, but 
the same problem is polynomial bounded if 
either M=2 or G is a forest or anti-fo- 
rest. 


This paper shows that for arbitrary M the 
restriction to forests or anti-forests is not ne- 
cessary to get polynomial bounded scheduling algo- 
rithms and derives such algorithms for other 

FAWD's: For an arbitrary elementary FAWD (b) 
G, with Wi=1, 1 S i SN, arbitrary integers M>O 
and CT > O the set of all nonpreemptive schedules 
for M processors to process G completely within 
time CT, A" (G;M;CT) , is described by a scheduling 
scheme, the algorithms of which are polynomial 
bounded in N. So the attempt to keep track of the 
increase of complexity when generalizing the sche- 
duling problem is made with arbitrary M, allowing 


(a) 


A FAWD G is called a forest (anti-forest) iff 
each node in G has at most one immediate suc- 
cessor (predecessor). If a forest (or anti- 
forest) G is connected, it is called a tree 
(or anti-tree). 


(D), tree (anti-tree) is called i-tree (1l-anti- 


tree) iff all nodes with indegree (outdegree) 
2 2 are located on one path (respectively) and 
for each edge of G its target (source) -node 
lies on this path. A FAWD G is called an ele- 
mentary FAWD iff G is the disjoint union of a 
1-tree and 1l-anti-tree (see figure 1). 
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the precedence relation to become somewhat more 
complex than that of a forest or anti-forest, and 
not the other way around, with arbitrary FAWD G, 
allowing M to become larger than 2. 

For the preemptive case and arbitrary Wi, 
1< i<N, the authors will submit further and 
more general results in [26]. 


II. Results 
Let G be a basic FAWD and let Wy=1, 
1S iS N. For the case that preemptions are not 
allowed an algorithm bounded by N t3 is derived 
first that determines CTpin (GM). If G is an ele- 
mentary FAWD moreover the set A"™(G;M;CT) of all 
nonpreemptive schedules will be described for an 
arbitrary given CT 2 CTmin(G:M). The problem of 
determining Mynin(G;CT) for arbitrary CT > O such 
that A? (G;Mpin(G;CT);CT) #4 ® will be investigated 
elsewhere. For ease of presentation we introduce 
the following notions. = 
For a basic FAWD G we define G to be a ma- 

ximal anti-forest of G and G to be the subgraph 
of G consisting of all nodes of G not contained in 
G and all edges between these nodes in G (see 
figure 1). Obviously G is a forest. Note that in 
general G is not uniquely defined and that the 
algorithm for constructing it is polynomial boun- 
ded (see [25]). We do not represent a graph G in 
the usual way (see figure 1) but use the self-ex- 
planatory representation of G in figure 2. Such a 
representation of G is called stripe representation 
R(G). Note that for each G there are infinitely 
many stripe representations R(G). For an arbitrary 
stripe representation R(G) of G let b(R;G;CT-t) 
for OS t SCT be the number of tasks cut by a 
height-line through CT-t (see figure 2). For an 
arbitrary G, M and CT, the stripe representations 
of principal interest are those for which 

(ae b(R;G;CT-t) dt=N and b(R;G;CT-t)S M, OStSCT; 


these are called (M,CT) -stripe representations of 
G. An (M,CT)-stripe representation R(G) of G is 
called monotonic increasing (decreasing) iff 
b(R;G;CT-t) < b(R;G;CT-t') (b(R;G;CT-t) 

= b(R;G;CT-t'), respectively) in this representa- 
tion R(G) forOstst' $ CT. 

For arbitrary integers CT > O and M> O and 
an arbitrary subgraph G' of G let pd(G';CT-t) be 
a mapping: {[i-1,i); i=1,...,CT} {0O,1,2,...,M}, 
called the processor distribution for G'. For 
Ost SCT the value of pd(G'; CT=t) g: gives us the 
number of processors available for processing of 
G'.at time t. We sometimes use the shorter nota- 
tion pd when no confusion is possible. For arbi- . 
trary pd let A"(G';pd;CT) denote the set of all 
schedules for complete processing of G' with pd 
processors in CT time units. An (M,CT)-stripe re- 
presentation R(G') of G' such that 
b(R;G';CT-t) S$ pd(G';CT-t), OS t S$ CT, is called 


a (pd,CT)-stripe representation of G'. 


Lemma 1: Let G be a FAWD, let CT > O and let 
pd:= pd(G;CT-t). Then 
A" (G;pda;CT) # @ # 3 (pd,CT) -stripe vepresen- 
tation R(G) of G. 


—_> 


Theorem 1: 
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Proof : 


The level-by-level schedule, using 
pd(G;CT-t) processors at time t and ap- 
plied to G in representation R(G), is 
in A™(G;pd;CT). 


Let S€A (G;pd;CT). Processing G according 
S defines a (pd,CT)-stripe representation 
of G. q.e.d. 


Let G be an arbitrary basic FAWD with 
weights Wy=1, 1S iSN, andG andG de- 
rived from G as defined above. Let CT and M 
be arbitrary integers such that A” (G;M;CT) 

# 0. . 

Then there exists a pd := pd(G ;CT-t), mono- 
tonic increasing, and a 

pd s:=pd(G ;CT-t), monotonic decreasing, such 
that _ ie okt 

pd (G ;CT-t) + pd (G ;CT-t)S M, OS 
HF a ie oe 
A'(G ;pa ;CT) # > and A (G spa ;CT) # . 


t= cr, 


Proof: From A” (G;M;CT) # > and Lemma 1 we get an 


(M,CT)-stripe representation R(G) of G and 
therefore the stripe representations R(G ) 
and R(G ), too. If these R(G ) and R(G ) are 
not monotonic increasing and decreasing, re- 
spectively, then we change them - without 
violating precedence rules in G - such that 
the resulting stripe representations of G 
and G have this property. The way this ex- 
change is done can easily be seen from fi- 
gure 3 and is described now. In this case 
there exists an integer to, 1 St CT, 
such that at least on¢ of the two equalities 
b(R;G i;CT-to) = b(R;G 7CT-tot1) + Ki and 

K2 + b(R;G ;CT-to) = b(R:G ;CT-tot1) holds 
for some K1, K2 > 0. We show how to proceed 
in the case that both equalities hold; the 
case that only one of them holds is treated 
by applying only a part of the procedure de- 
scribed subsequently. We first investigate 
the case Ki = K2 = 1. 


AS b(R7G ;CT-to) = b(R;G ;CT-to+1) + 1 and 
G is a forest there is at least one task T 
in G starting on heightline CT-t, in the 
present representation R(G ), that has no 
predecessor ending on heightline CT-to.. The- 
refore it could be shifted up by one, lea- 
ding to a new representation R'(G ), if by 
this action no precedence constraint of the 
original graph G were violated. Let us as- 
sume that we violated such a precedence con- 
straint of G. Then there is a task T' inG, 
which is a predecessor of T in G . Then be- 
cause G is a maximal anti-forest the task 
T belongs to G , what is a contradiction. 
Therefore T can be shifted up by one to 
start on heightline CT-to+i. The analogous 
argument allows us to shift one of the 
tasks of G ending on heightline CT-to down 
by one to start on CT-t,.. This is true be- 
cause G is a basic FAWD. 


If Ki and/or K2 are greater than one and/or 
there exist several to, for which the above 
equalities hold, finite repetition of this 


\ / 


Figure las 


An example of a basic FAWD G represented in 
the usual way. 
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Figure lb 


A_decomposition of G into a maximal anti-forest 
G anda forest G .G is the subgraph of G con- 
sisting of all nodes of G not contained in G 
and all edges between these nodes in G. 

The edges deleted from G are drawn by dashed- 
lines. 


v, 


An example of a i-tree and a l-anti-tree, 
respectively. 


Figure ic 
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procedure eventually provides us with a mo- 
notonic_de¢reasing (M,CT)-stripe represen- 
tation R(G ) of G and a monotonic increa- 
sing one R(G ) of G . Obviously then G is 
brought to another (M,CT)-stripe represen- 
tation, immediately defined by 

R(G ) and R(G ). 


We now define pd’ := b (R;G ;CT-t) and 
pd := b(R;G ;CT-t), OS T < CT. 


+ _ 
Then obviously pd and pd are monotonic de- 
creasing and increasing, respectively, and 
pad +pd SM. 


Applying the level-by-level schedule (C11]) 
with pd and pd processors to R(G ) and 
R(G ), respectively, shows that 


A’ (G ;pda ;CT) # @ and A" (G';pd*;CT) # O. 
q.e.d. 


Corollary 1: Let the assumptions of Theorem 1 
be true. . 


Then pd” and pd from Theorem 1 can be cho- 
sen such that at least one of the inequali- 
ties pd(G ;CT) »* pd(G ;CT) # O and 

pd(G ;0) * pd(G ;0) # O holds. 


Let R(G) be the (M,CT)-stripe representa- 
tion of G constructed for the proof of Theo- 
rem i, Let b(R;G ;CT) * b(R;G ;CT)= O and 
b(R;G ;0) * b(R;G ;0) =O (otherwise no fur- 
ther proof is needed). 


Proof: 


Apply the exchanging procedure described 
above to a highest task T in R(G ) and 
lowest task T' in R(G ) such that T is moved 
up and T' is moved down and such that the 
resulting (M,CT)-stripe representations of 

G andG are monotonic again. Repeat this 
step as long as necessary until the asser- 
tion becomes true (see [25]). q.e.d. 


It seems to be not difficult to generalize 
Theorem 1 to an arbitrary FAWD G whose underlying 
undirected graph is acyclic. In this case the pro- 
blem arises to determine an appropriate decompo- 
sition of G into a maximal anti-forest G and its 
associated forest G ; this latter problem disap- 
pears if G is assumed to be a basic FAWD. But dif- 
ficulties arise if_one attempts to extend the 
above notions of G and G _ such that Theorem 1 
holds for the case that G's underlying undirected 
graph contains cycles (see example in figure 4). 


Given an elementary FAWD G and an arbitrary 
pd we often will make use of the so called ‘high- 
est task first'-schedule, Spmp(G), for selecting 
free tasks for assignment to the pd processors 
while processing G. This schedule was investigated 
first for G being a tree in [1] and for G being an 
anti-forest in [23]; in both cases pd = M was as- 
sumed. 
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The stripe representation R(G) of G. The lines 
represent the nodes, the weights of which deter- 
mine the lengths of the lines (in this case 

W; =1, 1S i S$ N).The precedence rules in G are 
shown by the dashed lines; in this example 
b(R,G,CT-3.5) = 2 and b(R,G,CT-6) 1. 


CT-5 


“ athe ere eae 


H(G) 


___height-line h 


h 


tu 


The special stripe representation Rr" (Gg) of G. 
The lines are placed as low as possible; in this 
stripe representation b(R",G,cT-6) = 4 and 
b(RU,G,CT-8) = 2. 


Figure 2b 
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Lemma 2: Let G be an arbitrary 1-tree or i-anti- 
tree. Let CT > O and pd be an arbitrary 


processor distribution for G. Then the im- 
eee ae 5 b(R,G ;CT-t +1) b(R,G ,CT-t +1) 
Supp (G)GA (GipdsCT) * A (Gipd;CT) = > CT-t +1- — = 


anna 


The proof of this Lemma is an elementary modifi- 
cation of arguments used in [10,11,22,23], and is cT=t 
therefore omitted here. 


Theorem 2: Let G be an arbitrary elementary FAWD 
and M > O. Then CT,in (GM) can be computed cT-t -1 
by testing (at most) nM gifferent processor- 
distributions, i.e.: CTyjin(G:M) can be com- 
puted by an algorithm the number of steps of 
which is bounded by const - M- NMt3, 


Proof: Due to Theorem 1 we may restrict ourselves 
to monotonic processor-distributions, the 
total number of which is nN’, + 
For a given processor distribution pd for Figure 3 
G we define pd 7= M - pd. By applying the_ 
Syrmp-schedule to G and G (with pd and pd 
processors, respectively) and using Lemma 2 
it can be decided in at, most const - M+ N2 
steps, whether A"(G ;pd ;CT) # > and 
A"(G ;pd ;cT) # >. In order to find the 
smallest such CT at most N repetitions of 
the whole procedure are required. q.e.d. 


Situation before exchanging 


Remarks: 


1) Note that without Theorem i it would have been 
necessary to test (2N)M different processor-dis- 
tributions instead of nM, 


2) The restriction of G to be an elementary FAWD 
is sufficient but not necessary for validity of 
Theorem 2. The restriction allows us to use the 
simple Syrr for deciding the question, whether for 
an arbitrary given 1-tree or l-anti-tree G', 

CT > O and pd the set A” (G';pd;CT) is nonempty. If 
we omit this restriction totally, no effective al- 
gorithm is known at present to decide the same 
question for the resulting more general case. The 
authors will give an investigation of this problem 
elsewhere and hope to be able to derive polynomial 
bounded algorithms to solve the more general prob- 
lem. 


Figure 4 


3) Obviously we used extremely crude bounds. The- 
se bounds can substantially be improved by taking 
into account the structure of the graph investi- 
gated (see [25]). 


Let G be an elementary FAWD, CT an integer, 
pd(G;CT-t) a processor distribution for G and CT. 
Then the triple (G;pd;CT), as well as all its com- 
ponents, are called admissible iff A"(G;pd;CT)# 9. 
For given G and CT let PD(G;CT) denote the set of 
all admissible pd's. Let Gt and G be G's under- 
lying 1-tree and i-anti-tree, respectively; let Figure 5 
pd*€pp (G*;CT-t) and pad €PD(G";CT-t) and let 
pd(G;CT-t) denote the processor distribution for . 
G with pd(G;CT-t) := pdt (Gt;cT-t) + pd™(G7;CT-t), 
Ot SCT, which only allows pdt processors for 
Gt and pd™ processors for G™ at any time t, 

OSt SCT. Let (G;pd;CT) be an arbitrary admis- 
sible triple; then an assignment X of atmost 


An example for the case M = 3. 
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pd(G;CT) processors to free tasks of G is called 
admissible iff the triple (G\T (X) ;pd;CT-1) is ad- 
missible, where T(X) denotes the set of tasks from 
G assigned by X; let X(G;pd;CT) denote the set 

of all admissible assignments X for (G;pd;CT). 


Theorem 3: Let G be an elementary FAWD and let gt 
and G be its underlying 1-tree and i-anti- 
tree; let CT > 0. Let pdt(Gt;cT-t) be an 
arbitrary processor distribution for Gt. Let 
Gt and G be processed with pd* and M-pd* 
processors, respectively, both according to 


Surr 


Then pd is not admissible if G is not pro- 
cessed completely after time CT. 


The proof of Theorem 3 is an elementary 
application of Lemma 2. Remember that the 
complexity of the Spmpp— scheduling algorithm 
is bounded by const « M « N2, Note also that 
Symp for M processors applied to an elemen~ 
tary FAWD'G (omitting the processor distri- 
bution prescription) need not imply complete 
processing of G in CT time units (see fi- 
gure 5). 


We will now explain the form of the solution 
to the problem of describing A™(G;M;CT) that one 
would like to get and that one we are able to de- 
rive at present. 

For an arbitrary given admissible triple 
(G;M;CT), where G is an elementary FAWD, we give 
a scheduling scheme € from that all schedules from 
A™(G;M;CT) could be derived (by appropriate inter- 
pretation of this scheme) provided that we can 
find suitable algorithms x and z. 


SCHEMA 0 


Input: Elementary FAWD G, 
admissible integers M> O and CT>O 


~ Apply the algorithm z to the admissible 
triple (G;M;CT-t) in order to compute the 
sets Ppt (Gt;M;CT-t) ¢ PD(Gt;M;CT-t) and 
PD~ (G7 ;M;CT-t) © PD(Gu;M;CT-t) such that 
(M-pdt) €PD~ (G7-;M;CT-t) for each 
pateppt (Gt;M;cT-t) . 

- Choose an arbitrary pateppt (Gt;M;CT-t) . 

- Apply the algorithm x to the acmissibLe trip- 
les (Gt; pat; ;CT-t) and (G~ ;M-pdt ¢CT-t) in or- 
der to compute the sets X*:= x(ct; ;pat; ;CT-t) 
and X-:= X(G ;M-pd‘;CT-t) . 

- Choose an arbitrary assignment xtext and 

x-E X™ 

- s:= BU {(t,T(X*) UT(X7))}, delete 

T(xt) U T(X7) from G, 


t:= t+i. 
<= & 
YES 
Output: Sequencing-list S 


< 
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We first note that scheme € becomes a sche- 
duling algorithm as soon as it is interpreted, i. 
e. a rule is added, how to choose pdt€ppt, xt€xtand 
X-€X-. The algorithms z and x are not affected by 
this specification. Second we see that scheme (0 
would provide us with the most general solution to 
the scheduling problem in this case At the be- 
ginning of each time interval 0,1,2,...,CT-1, the 
algorithm z first shows us what possibilities 
exist for the choice of admissible pd's. After 
having chosen a suitable pat, algorithm x tells us 
what possibilities exist for the choice of X that 
are compatible with the already fixed pd. 

As the investigations concerned with an al- 
gorithm z are quite elaborate ([25]), another 
scheme o for describing the set A” (G;M;CT) for an 
arbitrary admissible (G;M;CT) is presented. Com- 
pared to the above scheme 9 this new scheme O will 
not contain the algorithm z but an algorithm sche- 
me 1 which describes the set PD*t(Gt;M;CT-t) and 
therefore PD (G ;M;CT-t), too. We give this sche- 
me 1 first. As we are interested mainly in poly- 
nomial boundedness we can afford to construct a 
simple T. 


SCHEMA T 


Input : Elementary FAWD G, admissible M and CT 


- Apply the sigorithn y to the admissible tri- | 
ple (Gt;M;CT-t) in order to compute 
Q(t) := {q(t)/ O < q(t) <M _ such that | 
pat (ct;cT-t'):= q(t') and 
pa” (G°;CT-t'):=M- q(t'), OS t' St, defin 
a 'prefix' of length t of an admissible pd 
for G and CT}. 

~ Choose an arbitrary q€Q(t) 
and delete Q(t)\{q}. 

- Delete q(t) highest free tasks from Gr and 
M - a(t) highest free tasks from Ge (if on- 
ly q'< q(t) and/or q"< M- q(t) free tasks 
are available in ct and Gee respectively, 
delete these q' and/or q" tasks). 


let q(t):=q 


Sse ee 


- t:=t #1. 
— 
YES 
Output: pd:= {(q(t) ,M- q(t))/O<ts cr-1} | 


t 


Note that for algorithm y we can use a sim- 
ple modification of an algorithm y' to compute 
CT min (G7 M) agcording Theorem 2; then y is bounded 
by const - M“+- N Z 


Theorem 4: Let (G;M;CT) be an admissible triple, 
where G is an elementary FAWD. Then the al- 
gorithm scheme T describes the set of all 
admissible pd's for the triple (G;M;CT). 
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Again we do not give a formal proof but note 
that 
each interpretation of T leads to an ad- 
missible pd for (G;M;CT), 
each admissible pd for (G;M;CT) can be 
obtained from T by an appropriate inter- 
pretation (defined by this pd). 


For the rest of the paper we are mainly con- 
cerned with the definition and investigation of 
the algorithm x from ~. We will define x only for 
the case that G is a i-tree ct; the case that G 
is a i-anti-tree can be treated similarly. Let 
us remember that x is to be a polynomial bounded 
algorithm the application of which to an admissi- 
ble triple (Gt; pd;CT) provides us with the set 
X(G*; pd;CT) of all admissible assignments. This 
set X will essentially be defined by a ‘lowest 
assignment function X>' which has the property 
that an arbitrary X belongs to xX iff X's ‘associ- 
ated' assignment function X is 'higher' than Xp. 
Explicitly this means: Xo is a total function from 
RigereeeHte)) > {0,1,...,k} such that 


a Xo(i) = k, where k is defined by x; for an 


arbitrary assignment X its assaciated assignment 
function X: {1,2,...,H(G)} > {0,1,...,k} is total 
and defined by X(i) number of processors assig- 
ned to tasks of Gt starting on height-line i, 

1 < iS H(G). X is higher than Xp iff 


iE, K(H(@)-4) 2 3£, Ko lH (6)-4) for all 


j = O,1,...,H(G) = 1. 


Definition of algorithm x 


Let an arbitrary admissible assignment 
(G*; pda; CT) be given, where ct is a 1-tree in 
R4(G*) representation. . 
1) Determine the maximal number k' of processors 
that might be left idling by an admissible assign- 
ment X€X(Gt;pd;CT). (This k' can be computed by 
applying the Syrr for i-anti-tree to the ‘inver- 
sed' of Gt, see [23]). Let k:= pd(Gt;cT) - k'. 
If k=O then Stop; (because an arbitrary assign- 
ment of k" processors, O S$ k" < pda(G*;cr), to ar- 
bitrary tasks of ct is in X).In this case there- 
fore algorithm x ends here. 

We now determine Xo for k processors. Let 
initially Xo(i):= 0, 1S i< H(G). Let i=, 


cGt:= Gt and h(i):= 1. 
2) If i>k then Stop; i.e. in the case k > 0 
algorithm x ends here. 
else 
begin 


L: if check (G{;CT-1;h(i)) then 

begin Xo(h(i)):= Xo(h(i)) +1; 
is= itl; 
delete one task from ct_; on height- 
line h(i-1) and call the result ct; 


goto 2; 
end; else 
begin h(i):= h(i) + 1; 
goto L; 
end; 
end; 
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Theorem 6: 


The boolean procedure check (G* ;cT=1;h(i)) 
returns the value true iff - 


a) in R’ (Gj) there is a free task T on height- 
line h(i) and 


b) deletion of task T and the highest k-i free 
tasks from Gf results in a graph Gt' such that. 
(Gt";pd;CT-1) is an admissible triple. 


End of description of algorithm x. 


Theorem 5: Let ct be a i-tree, let the triple 
(G";pd;CT) be admissible and let the algo- 
rithm x, and the result k and X95 of its app- 
lication to (Gt; pda;CT) be as defined above. 
Finally let X be an arbitrary assignment of 
k processors to free tasks from ct and X its 
associated assignment function. Then the 
following implication roalds: a 
XE X (Gt; pd; CT) # X is higher than Xo 


Remark: If |T(xX)| > k then only k of the tasks 
assigned by X are subject to the above constraint. 


The proof of Theorem 5 is elaborate and vo- 
luminous; therefore only its main ideas are cha- 
racterized by listing the Lemmas involved. A com- 
plete proof is given in [25]. 


Lemma 3: Let (G’ ;pa;CT) be admissible, let 
XEX(Gt;pd;CT) and let X' be an arbitrary 
assignment with |T(x')| 2 |T(x)|I. 

The following implication holds: 
X' is higher than x = X'EX (Gt; pd;CT) : 


As Xm defined by algorithm x is admissible 
by construction, the Lemma 3 assures that all 
assignments ‘lying above' Xp are admissible, too; 
i.e. Lemma 3 proves @ from Theorem 5. 


Lemma 4: Let the assumptions of Theorem 5 be true 
and let X€X(Gt;pd;CT). Then 
a) if Xo(i) =0,OS iS i' * X(i) =0, 
osisi'. 
b) if X(i) = X(i) 


osisi't #X(i'+1) 2 Xo(i'ti1). 


By Lemma 4 the monotonic increase of h(i) in al- 
gorithm x is justified. 


Lemma 5: Let (G’; pd;CT) be an admissible triple. 
Let X be an arbitrary assignment such that 
there exist i', i", OS i'< i'+2 <i" < H(G), 
for which X(i) < X,(i), OS i < i' and 
i" $i < H(G) and X(i) > Xo(i), i' <i< i". 
Then x€x (Gt; pd;CT) . 


By the last Lemma the uniqueness of X, is 
established. The proof of implication » from 
Theorem 5 follows essentially from Lemma 4 and 
Lemma 5. Note finally that the complexity of the 
algorithm x is bounded by const * (M+tN) ° M2e nit3 

We summarize the results of this paper in 


Let (G;M;CT) be an admissible triple, 
where G is an elementary FAWD. Then 
scheme 0, defined by the diagram 
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ay 
Input: Elementary FAWD G 
and admissible integers M and CT 


Te reemene ee aaa erate en ened 


aE AO NRA Ont TIRE Ble oe TT Renna ent a Nea a Bot 


Tepe naanieeinannaameatedaum tae 


_! 


= Invoke scheme 0 in order to obtain an 
admissible pd€PD(G;M;CT-t) . ‘ 
Apply the algorithm x to (G ;pd ;CT-t) 
| and (G ;pd ;CT-t) in order to obtain 
the sets x* and x”. 
- Choose an arbitrary xt€xt and x€x™. 
- S:= SU{(t,T(xt) UT(xX7))}, 

delete T(xXt) U T(x) from G. 
- t:= ttl 


: 
RN Re ale Ch RN EE CT EIN I RE BO ore 


{ 
i 


YES 
Output: Sequencing list S 


stop 


characterizes the set a" (G’;M;CT) . All algorithms 
involved are polynomial bounded. 
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A SCHEDULING MODEL FOR 
COMPUTER SYSTEMS WITH TWO CLASSES OF PROCESSORS 


R. E. Buten 
V. Y. Shen 
Computer Sciences Department 
Purdue University 
Lafayette, Indiana 47907 


Abstract -- This paper describes a simple 
algorithm to schedule a restricted set of jobs on 
a multiprocessor system with two classes of pro- 
cessors. Through deterministic analysis an upper 
bound is established for the behavior of the al- 
gorithm. This bound is seen to compare favorably 
with the upper bound intrinsic to the model. Sim- 
ulation results show the algorithm to be useful in 
scheduling less restricted job sets. 


1, INTRODUCTION 


In recent years computing systems have been 
routinely called upon to support a variety of on- 
line services in addition to carrying an ever in- 
creasing computing load. One major mainframe man- 
ufacturer's response to these divergent needs is 
a multiprocessor system composed of two types of 
processors. : 


One class or type of processor is primarily 
designed to perform floating-point arithmetic 
operations very efficiently. Accordingly, absent 
from its instruction set are many functions re- 
quired by general purpose usage. Notable among 
these are character oriented and I/O instructions. 
Let the type and number of this kind of processor 
be designated as A and m_ respectively. 


The other class of processor is equipped with 
a very low level instruction set. Character op- 
erations if not elegant are at least straight- 
forward. Its significant aspect, however, is the 
ability of this class of processor to perform I/0. 
In like manner, let the type and number of this 
kind of processor be designated as B and n, re- 
spectively. Any job run on such a system will 
necessarily require both kinds of resource. 


The resource requests of a job may be re- 
presented by a weighted, directed, and acyclic 
graph as shown in Figure 1. The graph is called 
the resource request graph of a job. This graph 


completely specifies the resource requests on the 
two types of processors and their precedence re- 
As indicated in the graph, the two types 
A and B, 


lations. 
of requests are made to processors 
respectively. 


Figure l 
Resource request graph for a typical job. 


Given a collection of such jobs it is the function 
of the scheduling algorithm to assign tasks (nodes 
in the resourse request graph) to available pro- 
cessors. A task is the, basic unit of allocation, 
i.e. once begun on a processor, it executes with- 
out interruption to completion. This is to say, 
consideration will be restricted to non-preemptive 
scheduling algorithms. 


The performance of a scheduling algorithm may 
be measured in several ways. Some of these are: 
mean throughput, average response time, and dead- 
line compliance. The measure used in this paper, 
however, shall be the amount of time needed to 
complete the entire set of jobs. An optimal sched- 
ule, therefore, is one in which the entire set of 
jobs is completed in the minimal time. 


It is generally acknowledged, based on anal- 
ysis of similar models, that the generation of 
optimal schedules for such a general problem re- 
quires an exponential number of steps. If a prob- 
lem of this nature were to have very large nodes, 
then the benefits derivable from the optimal 
schedule might very well justify a branch-and- 
bound approach, or perhaps even an exhaustive 
search. Since this is not the case under consid- 
eration, the hope for problems of this type lies 
in the development of simple heuristic algorithms 
which will produce optimal or near optimal results 
for the models in question. Where the models 
themselves defy analysis, it may be useful to 
develop heuristics which apply to simplified sub- 
sets. This approach seems to be the underlying 
motivation for work done on several similar models. 


T. C. Hu obtained results scheduling a tree 
of equal length nodes on a system of n_ identical 
processors [5]. Fujii, et al [2,3] and Coffman 
and Graham [1] have treated arbitrary acyclic 
graphs composed of equal length nodes and achieved 
optimal results for two processors of equal 
ability. 


Optimal results were also achieved with a 
Simple algorithm for flow-shop jobs run on two 
machines [7]. Extensions were made to this result 
which produced optimal schedules for two machines 
and all jobs with resource graphs of two nodes [6]. 


This research was supported in part by the 
Atomic Energy Commission. 
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The possibilities are shown below: 


es SM 6 
B b (v) 


Type l Type 2 Type 3 
Figure 2 


Resource graphs treated by Jackson. 


Graham [4] points out the existence of, and 
bounds the anomalous behavior of acyclic graphs 
executed on n identical processors when demand 
scheduling is used. Simulation results presented 
by Manacher [8] show the occurrence of anomalous 
behavior to be a commonoccurrence. This leads one 
to suspect that the bounds achieved for such cases 
may be a valid indicator of expected behavior. 


Shen and Chen [9] have achieved bounds for a 
multiprocessor system with two classes of pro- 
cessors when the job set is a restricted class of 
flow-shop jobs. This suggests a valid starting 
point for this analysis: the unrestricted flow 
shop. 


2. Analysis of the Flow-Shop Model 


Definition of the Model 


Let S denote a system composed of m _ pro- 
cessors of type A, and n processors of type B. 


Let F = {f,,f,,..-,f,_)f,} be a set of flow-shop 


jobs. Each job in F is represented by a two- 
tuple, (a,,b,), where a, = A-processor request 


and b = B-processor request, and a; must precede 
b,. That is to say, F is the set of jobs of 
type l. 


Algorithm 


Johnson's optimal solution [7] is for the 
special case of the model for which m=n=1. His 
strategy consisted of ordering the jobs according 
to the following simple criterion: 


f. procedes f. +f? min (a, sb 5) < min(a,.b,) 


A job set in which all its members have been 
sorted by the above criterion is said to follow a 
Johnson Order (JO). The optimality of the result- 
ing schedules was shown by proving that the order- 
ing minimizes the wait time on the B - processor. 
Furthermore, the existence of ties, when the left 
and right side of the relation are equal, indi- 
cates the non-uniqueness of the optimal schedule. | 


The ordering as it stands is unsuitable for 
use on multiprocessor systems since it fails to 
take account of the number of processors of each 
type. One would like to measure the impact of a 
node on the total resources of the system. This 
suggests a modification to the Johnson ordering 
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as follows: 
TE 0c ok fe | 
f, precedes ae if: min (—,- )< min (>"=—) 


in case of equality, largest f first. 
This Modified Johnson Ordering (MJO) has a comfort- 
able intuitive feeling since one is obtaining the 
optimal schedule on a Johnson machine of equiva- 
lent power. Denote this system as S'. S' has a 
Single A'-processor of speed=m-speed(A), and a 
Single B'--processor with speed=n-speed(B). It is 
also noteworthy that MJO is a generalization of 
JO, in that they are identical for m=n=1. 


One would hope that the demand schedule re- 
sulting from the MJO is optimal on S_ as well. 
The following example shows such a case. 

Example 1: Let m=n=2 and F = {(9,0),(9,0), 

(1,5), (1,5), (1,10) } 
A demand schedule for the job list ordered as 
given results in a worst case schedule, Ty. 


Figure 3 


Worst case schedule for example l. 


MJO causes the jobs to be executed in the reverse 
order of their appearance in F. This produces the 
optimal schedule for this set, shown below 


Figure 4 
Optimal schedule for example 1. 


Optimal schedules, however, are not always pro- 
duced by MJO. The next example shows MJO generat- 
ing a schedule which is much larger than optimal. 


Example 2: Let m=n=2 and F={(1,5),(1,5)(2,10)} . 
The jobs as listed are sorted by MJO. The demand 
schedule, which is actually a worst case, is 
shown below. 


Figure 5 
Worst case schedule for example 2. 
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Reversing the job list results in a better 
schedule: 


Figure 6 
Optimal schedule for example 2. 


These examples prompt two questions regarding MJO. 
How much of a saving can MJO produce? And, how 
badly can it perform? 


Bounds 


In deriving the MJO algorithm for S, we 
made intuitive use of S', a Johnson Machine of 
equivalent power. Perhaps it would be instruct- 
ive to further compare these two systems in order 
to determine the performance of MJO. The re- 
quests for resource usage on S was represented 
by a two-tuple, (a;,b,). This same job, when 
executed on S' réquires resource usage of 
(ai,bi), where 


a. b 
a. e DAS a (1) 
Since S' has the equivalent power of S_ and is 


of simpler structure, one would intuitively 
expect that it could perform the same work load 
as S. 
Formalizing this expectation we have: 
Lemma 2.1: Given a schedule for a job set F, 
on a system S whosecompletion time is T , 
then there existsa schedule on S' with a 
completion time of 1A such that 
rh < Tr 
e- e°* 
Proof: The proof is by construction of the re- 
quired schedule on S'. 


As a first step, one considers a simulation 
by S' of the schedule yielding Le on S. To 


accomplish this we divide the schedule into unit 
time slices. For each time slice on S_ from 1 to 
Ty: let A' execute a unit portion of that task 


executing on each of the m A-processors, while 

B' executes a unit portion of that task executing 

on each of the n B-processors. Since speed (A') 
=m-speed(A) and speed(B')=n*speed(B), it is 

clear that S' can keep up with the progress of 

S. Let a time slice of a, or b; executed on S' 


be denoted by a? and bF respectively, where 


(2) 


A time slice in which no task is executing is 
said to be executing the idle task, denoted by x. 
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Define 
Ta. 
i 


Tb. 
i 


time of the last occurrence of as 


time of the first occurrence of be 


The second step is to rearrange this simu- 
lated preemptive schedule into a non-preemptive 
permutation schedule. The following procedure is 


used: 
Step 1. Sort the list of Ta's in ascending order. 
Do the same for the list of Tb's. 
Step 2. Relabel the time slices as follows: 
at > a; where j is the rank of Ta, in 
the sorted list of Ta's 
be > b; where j is the rank of Tb, in 
the sorted list of Tb's 
Step 3. Apply the following interchange rules 


until no further interchanges are possi- 
ble, i.e. until all slices of each task 
are juxtaposed. 


Rule 1. If a immediately precedes 
a; and i >j, or if x immedi- 
ately precedes ai then inter- 
change the two. 
t +t 
Rule 2. 


If b immediately precedes b, 
lL 

and i> j or if b. immediately 

precedes x, then interchange 


the two. 
Rule 1 has the property that no Ta, is ever in- 


creased, i.e. the completion time of no A_ task 
is delayed by the use of Rule 1. Rule 2 has the 
complementary property in which no Tb, is ever 


decreased, meaning that all precedence constraints 
between A- and B- tasks are preserved. 


After all time slices for each task are jux- 
taposed, the third step is to replace the time 
Slice notation with the job notation. The order 
of appearance of these jobs on A' is a permuta- 
tion of F which will produce this schedule or 
one better. The schedule thus constructed is a 
permutation schedule since the order of completion 
of A~-—tasks is the same as the order of the initi- 
ation of B-tasks, which for S' is the same as 
the order of initiation of A-tasks. 


Since there could exist wait time between the 
completion of an A _ task and the initiation of 
its associated B task, the schedule produced 
thus far is not necessarily a demand schedule. 
Therefore, the final step is to convert that sched- 
ule to a demand schedule by advancing the start of 
all b,'s until either the time their associated A. 


process completed or the completion of the previ- 
ous B-task, whichever is greater. 


The following example will serve to clarify 
the proof as well as to show the need for such a 
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mechanism to guarantee the inequality. 


Example 3: Let m=n=3, and F={f,,f,,f,,f,,f, }where 
f)-£,=(4,0),f,=(1,3),£,=(1,2),f,=(1,1) 


The schedule producing T. is shown below 


Figure 7 
Optimal schedule for F on S. 


The first step produced a simulated schedule on S' 
as follows: 


Figure 8&8 
Simulation of T, by S! 


The second step first sorts the lists of times: 


Index Ta's To's 
] Ta Tb 
2 Tay Tb; 
3 Ta, Tb. 
4 Ta, 
5 Ta, 


then re-labels the the time slices using rank in 
the sorted lists. 


A! a mu a a at i at at al a ae 
BeBe eh Be 2 a SG a SS 
T ToT Tah aak 
t 
B x x x b,x xX b, b, x b, b, bg 


Step three performs all possible interchanges re- 
sulting in 

: * * * * 

A az, a, a a, a, 


EM ee See ee bb Bb, 


The ordering of F which produces T, is now clear, 


name ly f..f,.f,.f, .f,. 
last step to a demand schedule results in a time 
TS = 3.67. 


The transformation by the 


As pointed out in Example 2, MJO may not 
always produce an optimal schedule on S. Since 
it does produce the optimal schedule on S', it is 
of interest to compare the performance of MJO on 
S to that on S'. A multiprocessor system such 
as S since its power is based on parallel execu- 
tion of many jobs, cannot function effectively 
when severely ''underloaded". To obtain a meaning- 
ful comparison we would therefore like to discard 


cases where a single job dominates the schedule. 
This may be accomplished by requiring the follow- 
ing loading constraint (LC): 


a.t+b.<T! 1<i<r (3) 
ee eae 


h 


Lemma 2.2: Given a job set F, subject to LC and 


ordered by MJO. If T, and T! are the comple- 


h h 
tion times of a demand schedule on S and S' 
respectively, then 
i ee 
Ty) — max(m,n) ° 
Proof: The proof is by contradiction. We shall 


assume that there exists a set of jobs which vio- 
lates the bound. We may further assume that the 
number of jobs, r, is minimal. That is, b,. is the 


last B process to terminate on S_ when a. is 


the last job started, i.e. the last job in MJO. 
We can make the above assumption because if 


by is the last terminating process and k < r, we 


can consider the truncated set £iof,5++-.8, . The 


completion time on S_ for the truncated set, Pw 


is exactly T,. 


h On the other hand, the completion 


time on S' of the truncated set, PY? is less than 


or equal to Th since there are less jobs in the 


truncated set. Therefore, 
P T 
h > h oa ee 1 
t if 
Py Ty max (m,n) 


and the truncated set forms a smaller counter- 
example to the lemma. We shall therefore consider 
r to be minimal. 


The proof divides into two cases: 


Case 1: The last task to complete is on a 
B processor 
Case 2: The last task to complete is on an 
A processor. 
Case 1: 


Let Ta, denote the time a. completes execution 
and Tb, ‘the time b; begins execution for all i. 


The proof of this case shall be treated in three 
subcases. 


Case la: Ta. = Tb 


Demand scheduling requires there be no im- 
bedded idle time on the A or A'_ processors, 
i.e. the schedule is compact to the left. The 
latest time at which a, can begin is immediately 


after the completion of all other tasks. Thus 
for s, 
l r-l 
T <5 Re a. +a +b. . (4) 
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Correspondingly the time on S' is given by 


(5) 


Dividing (4) by (5) yields 


m-1 n-1 
aa? >) 


- (6) 
a 


Since the set forms a counter-example to the lemma 


m-1 n-1 
1 A 25g gp SE ean) 


q 1 
qth oh 


which reduces to 


TT max(m,n)-1 
h max(m,n) 


max (m,n)-1 


max (m,n) 


. (8) 


And this is acontradiction by the loading con- 
straint (3). 


< (a+b = 


Since consideration is restricted to demand type 
schedules, the following cases which treat 

TD. > Ta), necessarily have all B_ processors 
busy on the interval [Ta,,Tb_), as shown in 
Figure 9. We use Tr to indicate the end of the 
last idle period on the B_ processors prior to 


Tb... 
Yr 


Figure 9 


Case lb: Tb >Ta & T, = 0 
———— Yr cE 


k 

If there is no idle time on the B_ processors 
except terminal idle then the latest start time 
for the last B-process is immediately after all 
other B processes have completed, or 


l r-l 
ery a oF ee (9) 
If there is no initial wait time on the B- pro- 


cessors then the first m A_ processes must be 


zero. And the time needed on S!' is 
r 
1 (10) 
han 2 oF 
i=l 
The desired ratio is 
T. n-l 
h < J] + n Dis 
TT +r? (11) 


h h 


which leads to the same contradiction as (6). 
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fore 


From Figure 9, one sees that the schedule for F 
which produces Ty is first A-bound, and then B- 


Case lec: 1D. & T 


> Ta 
r 


bound. All previous cases dealt exclusively with 
one class of processors or the other. Therefore, 
to facilitate treatment of this case one would 
like to treat separately each group of jobs. 
Define partitioned job sets as follows: 


Job set l, Ft, contains every job which completes 
before Ty in addition to the truncated portions of 


those jobs in execution at that time, Or, - 


F = (= = CH ys , l<i<r such that 
i i’? i poles 
1_ : 
a= max(0,min(a,,T|-Ta,-a; )) 
lL ‘ 7 
b; max(0,min(b, ,T, Tb,))} ; 


Job set 2, Fe, contains the remainder of all jobs 


truncated to form pt as well as all jobs which 


begin execution after The Or 
Fe = {£2 = (a”,b*) , 1<i<r such et 
i ae a celles 
ay max(0,min(a,,Ta,;-T,)) 
bs= max(0,min(b, ,Tb,+b,-T,))}. 


Each job set has r jobs as before although ad- 
mittedly many are null. This manner of defini- 
tion, however, leaves the indices constant. For 
the treatment of these partitioned job sets to 
have relevance in bounding the total set, the 
following relation must be established. 

1 Z 


Tet, +t, 39 (12) 


where it and r. are the completion times of each 
of the partitioned job sets, sorted by MJO, and 
executed on S. 

The composition of F may be divided into 
the following subsets according to the makeup of 
R?, Define 


X = set of job indices a“ =0 > xEXx 
Y = set of job indices ana, yeyY 
U = set of job indices O<a*<a, uEU ‘ 


Referring to the ordering of the original set F, 
it is clear that 


fs precedes f for all u€U and 
Y (13) 
for all yeY , 


Since the set was subjected to MJO 


a a 
Min(—, =) < min (— : —) for all ueEU 
(14) 
and for all y€Y . 
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aaa 


Therefore 


f precedes f° precedes - 
. (15) 
for X, Uy, & Y . 
Since the validity of (14) cannot be altered 


merely by the reduction of a, /m when all else re- 


mains equal, all fo having zero A - requests, 


precede all else. Therefore, the execution of 


jobs in F? is identical to their execution in F 
to within a permutation of processor numbers. 
Therefore, 


2 
Th = T + Th : (16) 
And 
ee a See 
Tea 2 ao <T (17) 
i=] 
Substituting (17) into (16) produces 
1 2 
T stat lh > (18) 
and the validity of (12) is established. 
Consider next the relationship between Th 


t 
and qi + the sum of execution times on S! 


2! 
Ty» 
5 h 


of F' and F? ordered by MJO. From the definition 
of this subcase and the description of the part- 
t 


itioning, the execution time of tg is governed by 
! 
the A-tasks. Thus when i + i. is considered, 


no additional time is required for execution of 
the A--tasks. The only source of increased ex- 
ecution time for the two subsets arises when the 
execution of all or part of a B_ task is "held 
up"' due to the construction of the partitioned 
sets. This can happen in two ways. 

First, the "release" of b. for consideration 


by B' in the partitioned set is governed by the 
schedule of the A--tasks on S. If a. is an 


initial task on S, the release time of b, is 
exactly a;- If a; is also chosen to be an initial 
task on S', then the new release time is a, /m. 
Thus it is possible to delay the release of b; by 
a maximum of a C4 due to the partitioning of 
the job set. If a; is not an initial task on S, 
the earliest possible time to schedule a; on Ss! 
can not be sooner than the schedule time of a, on 


S. Therefore for tasks whose ordering are not 
changed by the partitioning of the job set, the 
release time of the B- tasks on B!' can be de- 


layed up to a, (A), where a 
task. 


Q is the largest A- 
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The second source of delayis an A-task 
whose ordering is altered by: the partitioning it- 
self. There are at most m A-tasks which were al- 
tered by the partitioning. Designate this set as 
U and let a, be any member of the set. When the 


job set is executed as a whole, the time at which 


b, is released to B' is given by: 


mn) ls 
RA = 2 ag R (19) 
i=] 
When F? is executed, the members of U_ may be re- 
ordered. 


The release time for dS in the sequential 


execution of the partitioned sets is given by 


p u-1 a. y a. 
Re < — + —- (20) 
- jet os ean a 
im) as 
< R + oes ; 
u iey 2 (21) 
ifu 
Let ay denote the largest member of U. Then (21) 
becomes 
Pp W m-1 * 
Ry Sk. + on ays (22) 


Since on S', the B-tasks are executed se- 
quentially, the delay experienced by the last B- 
task is bounded above by the largest delay pre- 
vious to it. Thus we have established the desired 
relation between the execution time of the whole 
set to that of the two partitioned sets, 


1B 2) m-1] 
fT et ay (23) 
From the construction of the partition we have, 
l! 
Th = T, : (24) 
Substituting this into (16) 
1' 2 
TA de (25) 
etal § p2+ bd (26) 
Se aes n . 
1=] 
And from S' we have 
28 ee | 
han * i. (27) 
1=] 
Substituting this into (26) leaves 
i! 2'. on-1l 
Ty, < T + Th + orga) D. . (28) 
which using (23) reduces to 
; m-1 n-1 
Da Te ee ay ee) b. : (29) 


The assumption of a minimal set requires {<r in 
the MJO, for any A_ task which is all or in part 
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contained in pt, This implies 


ay b. a. by 
min(— , =a, < min  s a, : (30) 
This requires either b <b, ora, <a 
| nae &{- 7r 
making the appropriate substitution leaves 
; m- 1] n-l 
T, ine ae. m ) os ad ( n ) De (31a) 
: m-1 n-1 
or. oy ST ( m ) a. 7 ( n ) b,. (31b) 


Dividing (31a) and (31b) by Ty leaves equations of 


the same form as (6), which leads to the same con- 
tradiction of the loading constraint: 


Case 2: An A- process is last to terminate. With 
the exception that bo; this case produces the 


same equation as (4) and of course leads to the 
same contradiction. 


All cases and subcases treated are seen to 
lead to a contradiction and thus the lemma is 
proved. 


The two previous lemmas may be used to prove 
the following theorem which provides a performance 
bound for MJO. 


. T 
Theorem 2.1: For a job set F, A J Se 1 
T, — max (m,n) 
where qT, = completion time of the shortest 
schedule possible for F on S, 
Th = completion time of the MJO sched- 
ule. 
Proof: Lemma 2.2 provides 
‘h eee eee 
ie = max (m,n) 
By Johnson's result Ty, is optimal, thus 
' ! 
Th < : 
By Lemma 2.1, T | <T 
e —-e 
Combining yields 
ooh 1 
oe max (m,n) 
e h f : 


Q.E.D. 
. _(n(n-1) jobs of the form €,1) 
eee eee -{ 1 job of the form (26,n) 


in a system that m=n. 


The MJO schedule is shown in Figure 10: 


444 
a 
44.4. 


Figure 10 
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The optimal schedule is shown in Figure 11: 


Figure ll 


From Figures 10 and 11, we have 


fp eee. 
n+2€ — 


_ 1+3€ 
n+2€ 


T. 
h 1 
T eae 
e 

Example 4 shows that the bound of Theorem 2.1 
is approachable. If we remove the "largest first 
rule" to break ties in MJO, which is not used in 
the proof of Theorem 2.1, the bound may be reached 


by scheduling n(n-1) jobs of the form (0,1) and 
one job of the form (0,n). 


The following theorem gives the worst case 
bound for the flow-shop model with two classes of 
processors. 


Theorem 2.2: For a given job set F, where T, is 


the earliest completion time possi- 
ble for a demand schedule on S, the 
latest completion time, Ty» is given 


by 


1 


7 e max (m,n) 


Proof: 


Consider a schedule which produces T 


1 


Ta Te 


wr 


Figure 12 


The following notations are used in Figure 12: 


b,. = last task to complete 
a, = A-task associated with Di 
TS = time a. completes execution 
Ty = time b,. begins execution 
T, = time of the last idle time on any 
k B 
-processor before Th: 
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The following are lower bounds on the earliest 


completion time possible: 


Yr 
= a., time needed to execute all 


Tee 
i=l * A- tasks , (32a) 


e 


ue >= x b,, time needed to execute all 
i=] B-- tasks . (32b) 


T, 2 a,+b; Isi<¥, time of the largest 
job . (32c) 


The proof is divided into three cases. 


Case 1: Ta,=Tb_; i.e. the last B-task to complete 


begins execution immediately after the 
completion of its associated A_ task. 


By the constraint of demand scheduling 
r 
1 m-1 
ee a wag 


a (33) 


which reduces immediately by (32a) and (32c) to 


Es < 28 or < 2 
eet Da 


(34) 
e 


Case 2: Ta, 7 Tb. & Ty = 0, i.e. there is no im- 


bedded wait time on a B--processor. The time is 


then given by 


: (35) 


which also reduces with the application of (32a) 
and (32c) to 


T 
Pose Cl Sec eo = 
Lear 26 n e T n 
(36) 
Case 3: 7 #1%60<T <4] 
By definition 
Ty = Ty + b. ; (37) 


And the time that the last B-task starts can be 
bounded by 


1 n-1l 
Tee ae b, (38) 


And the time of the last idle time on a B-pro- 
cessor is bounded by 


i oe 
Vee Tee oe Be Pep. (39) 
1=] 
Substituting (38) and (39) into (37) gives 
l r-1 1 r-1l 
LS ats 2b a +b  . (40) 
i=l isl =» 2 


Re-arranging , 


ae ee q-1 
ie ate Veal (a, 7) q > (41) 


where q = max(m,n). 
Applying (32a), (32b), and (32c) leaves 
<i eee 
T <T +T +4 T or ~<3-—= 
R e e q e q 


T 
e 


? (42) 


which is the largest of the case bounds. 
Q.E.D. 


The following example shows that the bound is 
approachable for a large set of jobs. 


m jobs of the form (n,0) 
jobs of the form (€,1) 
1 job of the form (€,n) 


Examp le 5: 
Let F = fnm-1) 


The demand schedule with the longest completion 
looks like 


time, T 


Figure 13 


It follows that T, = 3n-1+€@ 


The demand schedule with the shortest completion 
time, T is 


A 
B 
Figure 14 
And T =nt+ n(m-l) E+E 

e m 

T 

2 . 3n-1+ €(n/m) 

T n + a aes = 

e m 

: iv) 1 

lin — = 3 cae 
€>0 LL 


Table 1 places this flow-shop result in per- 
spective with known results of simple heuristic 
algorithms for similar models. 
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SYSTEM JOB SET ALGORITHM T/T T,/Tp T/T) 
2 processors of Flow-shop Johnson Ordering 2 1 1/2 
different types 
n identical independent largest 2-1/n (4/3) -(1/3n) (4n-1) / (6n-3) 
processors tasks first 
1 1 

- ifi ——>—— -_— 2n-1)/(3n-1 

At and F low-shop cers Johnson 3 maxten) maxGecn) ( )/¢ ) 
Table 1: Comparison of several known results on similar models 
3. Extensions of the Model 


In the previous section we dealt exclusively 
with jobs of type 1. Define G={8)5855+++98._ 185} 


to be a set of flow-shop jobs of type 2, i.e. the 
first of two nodes must be executed on a B pro- 
cessor. The MJO criterion can be restated for 
jobs of type 2: 


a. b. a. b. 
Se ante bp ee Fay of oe 
8, preceeds g; if: min (- 2m } < min Ee 7 ) 


in case of equality, largest g first. 


From symmetry considerations the bounds for type 1 
jobs also apply to type 2 jobs. Jobs of type 3 
may be considered as two jobs, a type 1 and a 

type 2. 


It would be of interest to see if MJO thus 
extended is of value in scheduling jobs with more 
general resource graphs. To be applicable, the 
more complex structures must be mapped into the 
two nodes of the model. This is done by taking 
the first available node of the resource graph as 
the first of two in the model. The second node 
of the model is constructed from the scaled sum 
of all nodes remaining. It is assumed to belong 
to the processor opposite to that selected as the 
first node, in accordance with the constraints of 
the model. Nodes belonging to the opposite kind 
of processor are scaled by m/n or n/m, whichever 
is necessary to convert all nodes to the same 
dimension. 


The pseudo jobs thus created conform to all 
the constraints of the model. These pseudo jobs 
are then sorted according to MJO. Assignments to 
processors are then made on a demand basis. When 
the first node of any pseudo job, the real one, 
completes a new pseudo job is computed for all 
nodes whose execution was precluded by the node 
now complete. Pseudo jobs are created in this 
manner until only two real nodes remain for a job, 
at which time there is no further need of pseudo 
jobs. 


A simulation test of this extension was con- 
ducted. The structure of the resource graphs was 
limited to two parallel paths, one A~- request 
and one B-request. A given node could depend on 
the previous A-node, B-node, or both. 10,000 
jobs were constructed in this fashion with the 
number of nodes per job being a uniformly distrib- 
uted random number between n+m and 2mn. The pre- 
cedence relations of each node were also determined 
by a random variable. The size of the nodes was 
also uniformly distributed between zero and the 
respective number of processors. 
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The job set was executed with the MJO ex- 
tension described above, and then again using the 
task list as generated. The MJO extension was 
found to provide job times that were on the 
average 4-8% smaller than produced by the random 
ordering. It is interesting to note that Manacher 


[8] quotes 5-15% as the typical savings of a 
heuristic over random in scheduling tasks on a 
system of n identical processors. 
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SCHEDULING IN A MULTIPROCESSOR ENVIRONMENT 


(a) 


J.M. Gwynn and R.J. Raynor 
School of Information and Computer Science 
Georgia Institute of Technology 


Summary 


In a multiprocessor system, the handling of 
interrupts generated by jobs in the processors is 
assigned to a supervisory program and associated 
data base. Techniques for deciding which proc- 
essor executes the supervisor includes master- 
slave, floating executive control, and others[ 1]. 
Regardless of the technique employed, queueing of 
requests to the supervisor may occur. Ina 
master-slave system, the master processor can 
handle only one request at a time. In a floating 
executive system, only one processor can access 
the supervisor's data base at a time[2]. 


Madnick has developed a finite-source 
queueing model which explicitly relates the nun- 
ber of processors in the system to the average 
number of processors idle due to clustering of 
requests to the supervisor. As an indication of 
the severity of the problem, his model predicts 
that a system with 21 processors will have an 
average of 2.8 processors idle due to supervisor 
clustering[2]. Due to the nature of his model, 
however, this may be a pessimistic estimate. 


A resolution to this problem can exist only 
if the supervisor is not saturated, ie, if the 
total expected execution time of the supervisor 
during a given period is not greater than the 
length of that period. Stated another way, the 
supervisor will not be saturated if the system is 
designed such that the supervisor is not a 
limiting resource. Assuming an unsaturated sys- 
tem, the natural solution to the problem would 
seem to be to schedule jobs to the processors in 
such a way that they would cause an interrupt at 
a time when the supervisor was idle[3]. The 
assumption implicit in this solution is that, for 
each job in the system, the time until the next 
interrupt must be predictable from the job's 
history of execution. While prediction of this 
information has not been implemented in many sit- 
uations, Pass has used a single exponential 


smoothing formula and corrector which dynamical- 
ly modifies the smoothing constant at each inter- 


rupt with promising success|4]. It will there- 
fore be assumed that this information, as well as 
the length of time the supervisor requires to 
handle an interrupt, can be predicted with some 
degree of accuracy. 


The algorithm to implement this solution 
would be a simple two table search. The first 
table would have an entry for each ready job in 
the mix specifying the time until the next inter- 
rupt and the supervisor time required to handle 
that type of interrupt. The other table would be 
a schedule of supervisor idle periods. For each 
job in the mix, a decision would be made as to 
whether the supervisor had an idle period corre- 
sponding to the period from current time plus 


(a)This research was supported in part by NSF 


process time to current time plus process time 
plus supervisor time. A match would cause that 
job to be scheduled. The order in which the 
first table is searched may be determined by 
priority or some other external criteria. 


While the algorithm just described is 
simple, the amount of computation involved would 
perhaps be prohibitive. For this reason a sub- 
optimal algorithm was developed which requires 
much less computation at the price of a small 
decrease in effectiveness. This algorithm is 
based on the original but with a discretization 
of time into blocks of time. Based on the num- 
ber of comparisons in the search, the sub-optimal 
algorithm is approximately 2(P+1)/F times faster 
than the optimal one; where P is the number of 
processors and F is the ratio of average super- 
visor time to block size. A more important point 
is that the sub-optimal algorithm would allow a 
hardware implementation, using only a few special 
registers, which would reduce the search to a 
few logical operations. 


For F=1, a case in which the hardware imple- 
mentation would be especially feasible, a GPSS 
simulation model has predicted that for 21 proc- 
essors there would be a reduction in average 
number of idle processors to 0.7, with a corre- 
sponding increase in thruput of 12%. While this 
is 75% of optimal improvement, it is expected 
that this could be improved, possibly to 90%, 
thru fine tuning of the algorithm parameters. 


Since the mix size is assumed to be large 
enough to find a job that will interrupt during a 
supervisor idle period, it is likely that some 
jobs may be delayed an excessive amount of time. 
The standard procedure for dealing with this 
problem is dynamic priority assignment. Current 
investigations are underway to determine the 
effect of this and other such modifications on 
the performance improvement gained thru the use 
of the algorithm developed here. 
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RADCAP: AN OPERATIONAL 
PARALLEL PROCESSING FACILITY 


James D. Feldman 
Goodyear Aerospace Corporation 
Akron, Ohio 44315 


Oskar A. Reimann 
Rome Air Development Center 
Rome, N.Y. 


Summary: An overview is presented of RADCAP, 
the operational associative array processor (AP) 
facility installed at Rome Air Development Center 
(RADC). Basically, this facility consists of a 
Goodyear Aerospace STARAN'*) associative array 
(parallel) processor and various peripheral devices, 
all interfaced with a Honeywell Information Systems 
(HIS) 645 sequential computer, which runs under 
the Multics time-shared operating system. The 
RADCAP hardware and software are described 
briefly here because they are detailed in companion 
papers presented at this conference (1) (2). The 
latter part of this paper dwells on the objectives of 
the RADCAP facility and plans for its use. 


()0M, Goodyear Aerospace Corporation, Akron, Ohio, 
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RADCAP Facility 


Figure 1 shows a block diagram of the hard- 
ware within the RADCAP facility. The 645, which 
has been in existence at RADC for several years, 
is a very large computer system with a multitude 
of peripherals typical of large time-shared systems. 
In March 1973, hardware was delivered to RADC 
in the form of a STARAN parallel processor with 
four arrays, a custom input/output unit (CIOU), a 
hardware performance monitor, and a variety of 
peripherals. Subsequently, the CIOU was used to 
interface STARAN with a 645 1/O channel. At the 
same time, STARAN software was interfaced with 
the 645 Multics time-shared operating system. 
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At present, the RADCAP facility is totally 
operational and includes system software to allow 
for operation in both a STARAN stand-alone mode 
and an integrated STARAN/Multics mode. 


STARAN Parallel Processor 


STARAN can perform search, arithmetic, 
and logical operations simultaneously on either all 
or selected words of its memory. Figure 2 shows 
the basic STARAN elements. The most important 
is the associative array and its unique multi-dimen- 
sional access capability which, along with the other 
elements, are described in more detail in refer- 
enced publications (1) (3) (4). Listed below are 
brief descriptions of the STARAN elements: 


Gs 
sional access, content-addressable memory with 
65,536 (2 16) bits of storage and 256 processing ele- 
ments; permits parallel arithmetic, search, and 
logical operations. 


2. AP control: performs data manipulation 
within associative arrays as directed by program 
stored in AP control memory. 


3. AP control memory: stores AP control 
instructions. Can also store data and act as buffer 
between AP control and other system elements. 


Associative array: provides multi-dimen- 


4. Sequential controller and memory: per- 
forms maintenance and test functions, controls 
peripherals, maintains job control, provides means 
for operator communication between various 
STARAN elements and, assembles STARAN pro- 
grams written in MAPPLE (Macro-Associative 


Processor Programming LanguagE). 


5. External function: transfers control infor- 
mation among STARAN elements. 


STARAN has been designed to provide a flexible 
1/O capability. The standard peripherals for 
STARAN are listed below, along with a typical list 
of optional peripherals: 


1. Standard: cartridge disk drive and control, 
paper tape reader, paper tape punch, and keyboard 
printer. 


2. Optional: line printer, card reader, mag- 
netic tape, keyboard crt, and other peripherals, as 
desired, that are compatible with the Digital Equip- 
ment Corporation (DEC) PDP-11. 


All these peripherals interface with the 
STARAN system's sequential controller, a PDP-11l 
mini-computer. STARAN also provides facilities 
for interfacing with other processors. The four 
buses provided, (see STARAN block diagram, Fig- 
ure 2) are the direct memory access, the buffered 
I/O, external function, and parallel I/O. 


DIRECT 
MEMORY 
ACCESS 
AP CONTROL MEMORY (DMA) 
BUFFERED 
MEMORY PORT LOGIC INPUT/ 
OUTPUT 
(BIO) 
AP SEQUENTIAL 
CONTROL CONTROLLER 
EXTERNAL 
FUNCTION 
EXTERNAL FUNCTION LOGIC. (EXF) 
ASSOCIATIVE 
ARRAY 0 
| 256 X 256 
| 
PARALLEL 
INPUT/ 
| OUTPUT 
| OPTIONAL (P10) 
| ASSOCIATIVE 
(pene ARRAY 
(UP TO 32 TOTAL) 


256 X 256 


Figure 2. 
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STARAN Block Diagram 
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The direct memory access is a 32-bit bus for 
STARAN to address external memory. The AP 
control or the sequential controller can access 
external memory at a rate dependent upon this 
memory's cycle time. 


The buffered I/O is a 32-bit bus for processors 
to address STARAN. Depending upon which portion 
of control memory is accessed, the access rate is 
0.4 to 1.0 microsec per 32-bit word. 


The external function is a bus for exchange of 
control signals. Discrete signals and interrupts 
can be both generated and accepted across this bus. 


The parallel I/O is a bus for STARAN array 
I/O. Up to 256 bits per array (e.g., one bit per 
array word) can be provided. If all 32 arrays are 
implemented, up to 8192 bits can be utilized in 
parallel at a transfer rate less than one micro- 
second, dependent upon the desired application. 


STARAN Performance Summary 


In a high-speed, asynchronous, pipe-line type 
processor such as STARAN, it is difficult to sum- 
marize performance since speeds vary with instruc- 
tion types, types of loops, etc. Also, the overall 
effective speed depends upon the number of words 
in the arrays over which the simultaneous opera- 
tions are occurring. However, an effort is made 
below to list the performance and features of a 256 
x 256 associative array, the control unit, and the 
interface portion of STARAN: 


Associative Array Features 


Up to 32 Arrays per system 
Multi-dimensional access (bit slice or word slice) 
Array module speed: 
Typical search: 150 nsec/bit 
Typical add or subtract: 800 nsec/bit 
Read bit or word slice (256 bits): 150 nsec 
Write bit or word slice (256 bits): 300 nsec 


Control Unit Features 
Iwo separate processors: AP control, sequential 


controller 


Solid-state control memory capacity: 2K x 32 
standard, 4K x 32 maximum 


Solid-state control memory speed: 150 nsec/ 
instruction (typical) 


Bulk core capability: 16K x 32 standard, 32K x 


32 maximum 


Bulk core speed: 1 microsec (read or write) 


Interface Capabilities 


STARAN to address external memory: rate- 
memory dependent 
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External processor to address STARAN: 0.4 to 
1.0 microsec/32-bit word 


Parallel I/O to/from associative arrays: less 
than 1.0 microsec/8192 bits (maximum) 


Control signals and interrupts 


Custom Input/Output Unit (CIOU) 


Figure 3 shows a simplified block diagram of 
the STARAN/RADCAP custom input/ output unit 
(CIOU). As indicated, the CIOU contains a parallel 
input /output (PIO) module, a 645 computer interface, 
and an internal performance monitor. The CIOU 
functions as a mini-processor much the same as the 
control unit portion of STARAN. Processing within 
one array module (e.g., under STARAN control) 
may be concurrent with I/O in another array module 
(e.g., under PIO control). 


| PARALLEL 1/0 MODULE i 
l 


MEMORY 
645 INTERNAL 
INTERFACE | PERFORMANCE 
LOGIC MONITOR 


CONTROL 


Figure 3. Simplified Block Diagram 


of Custom I/O Unit 


As directed by instructions stored in PIO con- 
trol memory, the optional PIO module manipulates 
data among and within the associative arrays con- 
current with operations as directed by AP control. 
The PIO module contains eight ports, with 256 bits 
per port to accommodate associative array I/O and 
to permute data. 


The 645 interface logic provides a communica- 
tion path between the 645 computer and the STARAN 
system. This interface logic contains a 30-charac- 
ter queue and a 32-bit status register which are tied 
to a 645 1/O channel. The status register contains 
interface control signals, and the queue buffers data 
being transferred to or from the 645. | 


The internal performance monitor, although 
contained in the CIOU, is best discussed in the 
following description of the hardware performance 
monitor. 
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Hardware Performance Monitor 


To help meet a RADCAP facility objective of 
measuring system performance, a hardware perfor- 
mance monitoring capability has been provided by an 
internal performance monitor in the CIOU cabinet 
and an external performance monitor system. Meas- 
urements can be made to determine instruction ex- 
ecution timing, control memory and bus utilization, 
array utilization, and activity in the pager, the PIO 
module, and the 645 interface. 


The internal performance monitor is used ex- 
clusively for STARAN instruction execution times 
and instruction event times. The events counted 
and timed are the execution of flagged instructions 
in AP control. Between a start flag and an end flag, 
a timer increments at a 100-nsec rate. Overflows 
from this counter interrupt the sequential controller. 
In additton, the sequential controller can interrogate 
the event counter and timer. 


The external performance monitor is a self- 
contained system that can monitor any point of 
STARAN or the custom I/O. Data are acquired 
via probes that detect logical signal changes in either 
an event count or elapsed time mode. Several probes 


can be logically connected via a patchboard to trigger 
a counter. At regular intervals, the contents of the 
counters are written as a record on a magnetic tape 
unit. The performance monitor software then eval- 
uates the collected data and produces the results in 
the form of reports and graphs. The software for 
the performance monitor runs on the 645. 


Physical Description of Hardware 


All the elements shown in the STARAN block 
diagram (Figure 2), including the associative 
arrays, are built using dual-in-line IC's (integrated 
circuits) mounted on multi-layer printed circuit 
boards. Thus, the physical construction of 
STARAN and the CIOU is similar to that of typical 
high-speed sequential processors.. 


Figure 4 shows Goodyear Aerospace's STARAN 
demonstration and evaluation facility. Table 1 gives 
the approximate numbers of cabinets, boards, and 
IC's for the various STARAN models. These fig- 
ures do not account for 1/O logic, since this varies 
from one installation to another. The STARAN/ 
RADCAP CIOU, which includes the parallel 1/O 
option for all four arrays, contains approximately 
200 boards and 8,000 IC's. 


Figure 4. 
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STARAN Demonstration and Evaluation Facility 
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Table 1. Approximate STARAN Component Count 


STARAN | No. of No. of 
Model Arrays Cabinets 
S-250 1 3 
S-500 2 3 
S-750 3 3 
S-1000 4 4 
S-1250 5 4 
S-1500 6 4 
S-1750 7 5 
S-2000 8 5 
5-4000 16 8 


“Without input /output 


Although up to three arrays can be packaged in 
one cabinet, the RADCAP configuration has two 
arrays per cabinet for symmetry. Figure 5 shows 
the equipment that was delivered to RADC. This 
includes a sequential control cabinet, an AP control 
cabinet, two AP memory cabinets for the four 
associative arrays, and a CIOU cabinet. The disk 
drive and line printer are mounted in separate 
cabinets. The keyboard/printer, the card reader, 
and the graphics display console can be mounted on 
table tops or pedestals. As mentioned earlier, the 
internal performance monitor is packaged within 
the CIOU cabinet. The external performance moni- 
tor, not shown in Figure 5, mounts on a table top. 
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Figure 5. 


No. of Printed 
Circuit Boards 


‘No. of Integrated 
Circuits 


9, 000 
11,500 
14, 100 
16, 700 
19, 300 
21,900 
24,900 
27,500 
48, 700 


Summary of System Software 


The system software available for STARAN/ 
RADCAP is capable of operating STARAN ina 
stand-alone mode or when integrated with the 645, 
ina STARAN/Multics configuration. The system 
software is based upon a disk operating system, 
which provides ready access to system programs, 
device independent I/O, and a file system. Opera- 
tion of STARAN can be under direct control of the 
user at the control console or run in a batch mode 
with a control stream from an input device like the 
card reader. 
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STARAN Complex at RADC 
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The total assembly package for STARAN has a 
macro language processor, an APPLE assembler, 
and a relocating linker. Programs are written in 
the APPLE and MAPPLE languages. Extensive 
string handling and substitution are implemented in 
the macro-preprocessor. APPLE is a symbolic 
language that includes mnemonics for parallel and 
associative operations. The linker combines 
separately assembled object modules by relocating 
code as necessary and resolving globally defined 
symbols. 


Control of processing in STARAN is through 
interactive system routines. These routines are 
the interface between application program execu- 
tion and the user. They allow the user to start and 
halt STARAN, to load programs and overlays, and 
to debug programs with trace, memory modification, 
and dump commands. 


Diagnostic programs for STARAN hardware are 
disk resident. The programs can be called individ- 
ually, in groups related to specific parts of the hard- 
ware, or as a total set for complete system testing. 
Fault detection and location are provided. 


Additional software for the integrated STARAN/ 
Multics operation is designed to handle the interface 
between the computers and the use of STARAN from 
Multics. For the interface, a special device driver 
module has been added to the STARAN disk operating 
system. This driver is similar to drivers used for 
peripherals. It has been specialized for Multics 
and can accommodate 16 open files simultaneously. 
A device interface module (DIM) has been added to 
Multics as the counterpart to the device driver. 
These two modules are basic parts of each machine's 
operating system and are transparent to the pro- 
grammer. 


STARAN can be operated from Multics by 
commands a user inputs at a terminal or froma 
file. File control procedures handle STARAN re- 
lated keyboard inputs, and provide the interface 
between the DIM and the MULTICS storage system. 
With these procedures, a user process executing 
in the 645 can call for execution of a STARAN 
program. 


To facilitate the assembly of STARAN programs, 
a cross assembler is provided for time-shared use 
in Multics. This assembler accepts MAPPLE and 
APPLE as inputs. 


Objectives and Uses 


The basic objective of the RADCAP facility is 
to explore the performance of a hybrid computer 
configuration (STARAN associative processor in- 
terfaced with a 645 sequential processor) on real- 
world, real-time problems. A specific goal is to 
determine the cost-effectiveness of associative/ 
parallel processing in such an environment. Asso- 
ciative processing has been studied extensively in 
both theoretical and simulation studies, but no 
significant practical operating experience with 
them exists. Experimentation is necessary to pro- 
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vide "hard" data and fill in the presently existing 
void. Practical operating experience also is re- 
quired so that a general-purpose associative proc- 
essor configuration could be developed if results 
warrant it. 


The RADCAP facility will be used in an experi- 
mental program to evaluate the internal performance 
of this hybrid computer configuration by means of 
hardware and/or software performance monitors 
to determine internal component utilization and 
system bottlenecks. Programming aspects of asso- 
ciative processing also will be investigated. Asso- 
ciative-processing programming is not well under- 
stood and represents radical departures from the 
traditional programming approach. The program 
loop is being replaced by hardware processing ele- 
ments. This requires a whole new programming 
attitude. Programming languages suitable for 
associative processors probably will be quite dif- 
ferent from present ones. This basic uncertainty 
must be explored and some practical operating 
experience gained. As atest problem, indicative 
of high data rate and real-time processing require- 
ments, the data processing functions of an air 
surveillance system (AWACS) have been chosen. 
The primary functions to be investigated are track- 
ing (both passive and active), display processing, 
and weapons control. 


The scope of the research program can be 
described with the aid of Figure 6. The flow will 
begin with the development of associative-sequen- 
tial algorithms for each of the AWACS data proc- 
essing functions. As these algorithms are being 
developed, the application engineers will make 
known to a language and system software group 
those instruction level and system routine functions 
required to support the AWACS processing func- 
tions. 


Based on this input, the language group will 
develop a language and implement this language on 
the RADCAP testbed. The system software activity 
will implement routines to support the command 
language. The applications program will then be 
run on the testbed using, where possible, nonsyn- 
thesized data as input. The machine activity will 
be monitored to gather statistics on utilization, 
identify system bottlenecks, and determine the 
efficiency with which the algorithms provide solu- 
tion. 


The data collected will then be analyzed to 
determine where cost effective improvements can 
be made to software and/or hardware in order to 
improve the cost-effective performance of the 
system. These changes will be incorporated into 
the system via micro program or software routines. 
If the change is to be a hardware design, that de- 
sign will be made to the gate level so that perfor- 
mance and cost effectiveness determination can be 
made. 


When the solution to the problem is finally 
refined, it will be contrasted with known sequential 
solutions. 
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Figure 6. 


Initially each of the AWACS data processing 
functions will be treated separately. The final 
task will then be to develop a system executive and 
integrate all the functions to reflect the real world. 
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STARAN/RADCAP HARDWARE ARCHITECTURE 


Kenneth E. Batcher 


Goodyear Aerospace Corporation 
Akron, Ohio 44315 


Summary: Hardware architecture is described 
for RADCAP, the operational associative array 
processor (AP) facility installed at Rome Air De- 
velopment Center (RADC), N.Y. Basically, this 
facility consists of Goodyear Aerospace STARAN?) 
parallel processor and various peripheral devices 
interfaced with a Honeywell Information Systems 
(HIS) 645 sequential computer, which runs under 
the Multics time-shared operating system. The 
hardware of STARAN/RADCAP is described with 
particular emphasis on the parallel processing 
elements. 


Introduction 


Companion papers presented at this confer- 
ence describe the potential use of the RADCAP 
facility and its software (1) (2). The STARAN as- 
sociative array (parallel) processor (3) employed 
in RADCAP has been modified to include a custom 
parallel input/output (PIO) unit and an interface 
to the 645 computer. 


The parallel processing capability of STARAN 
resides in four array modules. Each array module 
contains 256 small processing elements (PE's). 
They communicate with a multi-dimensional access 
(MDA) memory through a "flip" network, which 
can permute a set of operands to allow inter-PE 
communication. This gives the programmer a 
great deal of freedom in using the processing 
capability of the PE's. At one stage of a program, 
he may apply this capability to many bits of one or 
a few items of data; at another stage, he may apply 
it to one or a few bits of many items of data. 


The remainder of this paper deals with the 
MDA memories, the STARAN array modules, and 
the STARAN/RADCAP elements. 


Multi-Dimensional Access (MDA) Memories 


A common implementation of associative 
processing is to treat data in a bit-sequential 
manner. A small one-bit PE (processing element) 
is associated with each item or word of data in 
the store, and the set of PE's accesses the data 
store in bit-slices; a typical operation is to read 
Bit i of each data word into its associated PE or 
to write Bit i from its associated PE. 


The memory for such an associative processor 
could be a simple random-access memory with the 
data rotated 90 deg, so that it is accessed by bit- 
slices instead of by words. Unfortunately, in most 
applications, data come in and leave the processor 
as items or words instead of as bit-slices. Hence, 
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rotating the data in a random-access memory com- 
plicates data input and output. 


To accommodate both bit-slice accesses for 
associative processing and word-slice accesses 
for STARAN input/output (I/O), the data are stored 
in a multi-dimensional access (MDA) memory 
(Figure 1). It has wide read and write busses for 
parallel access to a large number (256) of memory 
bits. The write-mask bus allows selective writing 
of memory bits. Memory accesses (both read and 
write accesses) are controlled by the address and 
access mode control inputs; the access mode se- 
lects a stencil pattern of 256 bits, while the address 
positions the stencil in memory. 


For many applications, the MDA memory is 
treated as a square array of bits, 256 words with 


256 bits in each word. The bit-slice access mode 
(Figure 2A) is used in the associative operations 


READ/WRITE CONTROL 


WRITE-—MASK BUS (256) 


MDA 
MEMORY 


WRITE BUS (256) 


(65,536 BITS) 
READ BUS (256) 


ADDRESS BUS 


ACCESS MODE BUS 


Figure l. Multi-Dimensional Access Memory 
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Figure 2. Bit-Slice and Word Access Modes 


to access one bit of all words in parallel, while 
the word access mode (Figure 2B) is used in the 
1/O operations to access several or all bits of one 


word in parallel. 


The MDA memory structure is not limited to 
a square array of 256 by 256. 
data may be formatted as records with 256 8-bit 
Thirty-two such records 
can be stored in an MDA memory and accessed 
several ways. To input and output records, one 
can access 32 consecutive bytes of a record in par- 
To search key fields of the data, 
one can access the corresponding bytes of all rec- 
ords in parallel (Figure 3B). 
record for the presence of a particular byte, one 
can access a bit from each byte in parallel (Figure 


bytes in each record. 


allel (Figure 3A). 


3C). 


The MDA memories in the RADCAP array 
They exhibit read cycle times 
of less than 150 nsec and write cycle times of less 


modules are bipolar. 


than 250 nsec. 
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A — ACCESS TO 32 CONSECUTIVE BYTES 
OF A RECORD 


256 8-BIT BYTES 
XMS 


B — ACCESS TO CORRESPONDING BYTES 
OF ALL RECORDS 


256 8-BIT BYTES 


C — ACCESS TO ONE BIT OF EVERY 
BYTE IN A RECORD 


32 


Figure 3. 


256 8-BIT BYTES 


Accessing 256-Byte Records 


For example, the 


To search a whole 


STARAN Array Modules 


A STARAN array module (Figure 4) contains 
a MDA memory communicating with three 256-bit 
registers (M, X, and Y) through a flip (permutation) 
network. One may think of an array module as hav- 
ing 256 small processing elements (PE's), where a 
PE contains one bit of the M register, one bit of the 
X register, and one bit of the Y register. 


The M register drives the write mask bus of 
the MDA memory to select which of the MDA mem- 
ory bits are modfied in a masked-write operation. 
The MDA memory also has an unmasked-write oper- 
ation that ignores M and modifies all 256 accessed 
bits. The M register can be loaded from the other 
components of the array module. 


In general, the logic associated with the X reg- 
ister can perform any of the 16 Boolean functions of 
two variables; that is, if xj is the state of the ith 
X-register bit, and fj is the state of the ith flip net- 
work output, then: 


xi<-¢ (x;, f; ) (i: St (Oy ose rag 255) 
where ¢g is any Boolean function of two variables. 
Similarly, the logic associated with the Y-register 
can perform any Boolean function: 


yis-¢ (y;,£,) (i = 0,1,....,255) 


where y; is the state of the ;th Y-register bit. The 


programmer is given the choice of operating X 
alone, Y alone, or X and Y together. 


If X and Y are operated together, the same 
Boolean function, g, is applied to both registers: 


xi<p (x;,£,) 


The programmer also can choose to operate 
on X selectively using Y as a mask: 


x, +g (x, £; ) (where Y,= 1) 


Xi<-Xj (where : ee 0) 


Another choice is to operate on X selectively 
while operating on Y: 


1) 


xe (xi, f,) (where y, 


ee Game (where y; = 9) 


In this case, the old state of Y (before modi- 
fication by ¢) is used as the mask for the X oper- 
ation. 
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Figure 4. STARAN Array Module 
For a programming example, the basic loop The states of X and Y are now: 
of an unmasked add fields operation is selected. 
This operation adds the contents of a Field A of Pees ac. 
all memory words to the contents of a Field B of 
the words and stores the sum in a Field S of the Y: = a. @c 
words. For n-bit fields, the operation executes : 1 i 


the basic loopntimes. During each execution of 
the loop, a bit-slice (a) of Field Ais read from 
memory, a bit-slice(b) of Field Bis read, anda 
bit-slice (s) of Field S is written into memory. The 
operation starts at the least significant bits of the 
fields and steps through the fields to the most sig- 
nificant bits. At the beginning of each loop exe- 
cution, the carry (c) from the previous bits is 
stored in Y and X contains zeroes: 


x, = 0 


se 
Yi i 


The loop has four steps: 


Step 1: Read Bit-slice a and exclusive-or (@) it to 
X selectively and also to Y: 


X-<—xX.- 


i 1 @ ¥i44 


etme a 
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Step 2: Read Bit-slice b and exclusive-or it to X 
selectively and also to Y: 


xix, @ yiby 
Registers X and Y now contain the carry and sum 
bits: 
' 
y; = a; @ b; Oc; = 8; 


Step 3: Write the sum bit from Y into Bit-slice s 
and also complement X selectively: 


<x. : 
x: : a) Yi 
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A 


The states of X and Y are now: 


t 


Yi Ey 


Step 4: Read the X-register and exclusive-or it 
into both X and Y: 

X;*x; Ox; 

yi yi O Xj 


This clears X and stores the carry bit into Y to 
prepare the registers for the next execution of 
the loop: 


x, = 0 


- = Cae 
Yi 1 


Step 3 takes less than 250 nsec, while Steps 1, 
2, and 4 each take less than 150 nsec. Hence, the 
time to execute the basic loop once is less than 
700 nsec. If the field length is 32 bits, the add 
operation takes less than 22.4 microsec plus a 
small amount of setup time. The operation per- 
forms 256 additions in each array module. This 
amounts to 1024 additions, if all four array modules 
are enabled, to achieve a processing power of 
approximately 40 MIPS (million-instructions -per - 
second). 


The array module components communicate 
through a network called the flip network. A se- 
lector chooses a 256-bit source item from the © 
MDA memory read bus, the M register, the X 
register, the Y register, or an outside source. 
The bits of the source item travel through the flip 
network, which may shift and permute the bits in 
various ways. The permuted source item is 
presented to the MDA memory write bus, M reg- 
ister, X register, Y register, and an outside 
destination. 


The permutations of the flip network allow 
inter-PE communication. A PE can read data 
from another PE either directly from its registers 
or indirectly from the MDA memory. One can per- 
mute the 256-bit data item as a whole or divide it 
into groups of 2, 4, 8, 16, 32, 64, or 128 bits and 
permute within groups. 


The permutations allowed include shifts of 
1, 2, 4, 8, 16, 32, 64, or 128 places. One also 
can mirror the bits of a group (invert the left- 
right order) while shifting it. A positive shift of 
mirrored data is equivalent to a negative shift of 
the unmirrored data. To shift data a number of 
places, multiple passes through the flip network 
may be required. Mirroring can be used to re- 
duce the number of passes. For example, a 
shift of 31 places can be done in two passes: mir- 
ror and shift 1 place on the first pass, and then 
remirror and shift 32 places on the second pass. 


The flip network permutations are particularly 
useful for fast-fourier transforms (FFT's). <A 2” 
point FFT requires n steps, where each step pairs 
the 2" points in a certain way and operates on the 
two points of each pair arithmetically to form two 
new points. The flip network can be used to re- 
arrange the pairings between steps. Bitonic sort- 
ing (4) and other algorithms (5) also find the per-_ 
mutations of the flip network useful. 


Each array module contains a resolver reading 
the state of the Y register. One output of the re- 


solver (activity-or) indicates if any Y bit is set. 


If some Y bits are set, the other output of the re- 
solver indicates the index (address) of the first 
such bit. Since the result of an associative search 
is marked in the Y register, the resolver indicates 
which if any words respond to the search. 


STARAN/RADCAP Elements 


Each of the four array modules in STARAN/ 
RADCAP (Figure 5) contains an assignment switch 
that connects its control inputs and data inputs and 
outputs to AP(associative processor) control or the 
PIO (parallel input/ output) module. 


The AP control unit contains the registers and 
logic necessary to exercise control over the array 
modules assigned to it. It receives instructions 
from the control memory and can transfer 32-bit 
data items to and from the control memory. Data 
busses communicate with the assigned array mod- 
ules. The busses connect only to 32 bits of the 
256-bit-wide input and output ports of the array 
modules (Figure 4), but the permutations of the 
array module flip networks allow communication 
with any part of the array. The AP control sends 
control signals and MDA memory addresses and 
access modes to the array modules and receives 
the resolver outputs from the array modules. 


Registers in the AP control include: 


1. An instruction register to hold the 32-bit in- 
struction being executed. 


2. A program status word to hold the control 
memory address of the next instruction to be exe- 
cuted and the program priority level. 


3. A common register to hold a 32-bit search com- 
parand, an operand to be broadcast to the array 
modules, or an operand output from an array 
module. 


4. Anarray select register to select a subset 
of the assigned array modules to be operated on. 


5. Four field pointers to hold MDA memory ad- 
dresses and allow them to be incremented or de- 
cremented for stepping through the bit-slices of 
a field, the words of a group, etc. 


6. Three counters to keep track of the number of 
executions of loops, etc. 
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Figure 5. 


¢. A data pointer to allow stepping through a set 
of operands in control memory. 


8. Two access mode registers to hold the MDA 
memory access modes. 


The parallel input/output (PIO) module con- 
tains a PIO flip network and PIO control unit (Fig - 
ure 5). It is used for high bandwidth I/O and inter- 
array transfers. 


The PIO flip network permutes data between 
eight 256-bit ports. Ports 0 through 3 connect to 
the four array modules through buffer registers. 


Port 7 connects to a 32-bit data bus in the PIO con- 


trol through a fan-in, fan-out switch. Ports 4, 5, 
and 6 are spare ports intended for future connec- 
tions to high bandwidth peripherals, such as paral- 
lel-head disk stores, sophisticated displays, and 
radar video channels. The spare ports also could 
be used to handle additional array modules. High 
bandwidth inter-array data transfers up to 1024 


151 


SEQUENTIAL 
CONTROL 
MEMORY 


HIS 645 CHANNEL 


HIS 645 
INTERFACE 


REGISTERS 


DATA INSTR 


PERIPHERALS 


~ PIO CONTROL 


STARAN/RADCAP Block Diagram 


bits in parallel are handled by permuting data be- 
tween Ports 0, 1, 2 and 3. Array I/O is handled 
by permuting data between an array module port 
and an I/O port. The PIO flip network is controlled 
by the PIO control unit. 


The PIO control unit controls the PIO flip net- 
work and the array modules assigned to it. While 
AP control is processing data in some array mod- 
ules the PIO control can input and output data in 
the other array modules. Since most of the regis - 
ters in the AP control are duplicated in PIO con- 
trol, it can address the array modules associatively. 


The controlmemory holds AP control programs, 
PIO control programs, and microprogram sub- 
routines. To satisfy the high instruction fetch rate 
of the control units (up to 7. 7 million instructions 
per second), the control memory has five banks of 
bipolar memory with 512 32-bit words in each bank. 
Each bank is expandable to 1024 words. To allow 
for storage of large programs, the control memory 
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also has a 16K-word core memory with a cycle time 
of 1 microsec. The core memory can be expanded 
to 32K words. Usually the main program resides in 
the core memory and the system microprogram sub- 
routines reside in bipolar storage. For flexibility, 
users are given the option of changing the storage 
allocation and dynamically paging parts of the pro- 
gram into bipolar storage. 


A Digital Equipment Corporation (DEC) PDP- 
11 minicomputer is included to handle the periph- 
erals, control the system from console commands, 
and perform diagnostic functions. It is called se- 
quential control to differentiate it from the STARAN 
parallel processing control units. The sequential 
control memory of 16K 16-bit words is augmented 
by a 8K X 16-bit "window" into the main control 
memory. By moving the window, sequential con- 
trol can access any part of control memory. The 
window is moved by changing the contents of an 
addressable register. 


The STARAN/RADCAP peripherals include a 
disk, card reader, line printer, paper-tape reader/ 
punch, console typewriter, and a graphics console. 


Synchronization of the three control units (AP 
control, sequential control, and PIO control) is 
maintained by the external function (EXF) logic. 
Control units issue commands to the EXF logic to 
cause system actions and read system states. Some 
of the system actions are: AP control start/stop/ 
reset, PIO control start/stop/reset, AP control 
interrupts, sequential control interrupts, and ar- 
ray module assignment. 
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RADCAP connects to a common peripheral 
channel of a 645 computer. Channel characters 
are 6 bits wide. Instead of interfacing the channel 
to one of the three control units in RADCAP, the 
channel interface is assigned a set of control mem- 
ory addresses so it can be addressed by any con- 
trol unit. The interface has a 30-character first-in, 
first-out, (FIFO) queue to buffer the data transfer 
between the two machines. To reduce the number 
of queue accesses, the control units transfer queue 
data by character-pairs, 12 bits at a time. 
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STARAN/RADCAP SYSTEM SOFTWARE 


Edward W. Davis 
Goodyear Aerospace Corporation 
Akron, Ohio 44315 


Summary: System software is described for RAD- 
CAP, the operational associative array processor 
(AP) facility installed at Rome Air Development 
Center (RADC), N.Y. The description covers the 
software for the stand-alone operation of the Good- 
year Aerospace STARAN(@) associative array 
(parallel) processor, which is supported by a disk 
operating system with a macro-assembler, a 
relocating linker and loader, an interactive debug 
package, and control procedures. Also described 
is the software for the STARAN processor when 
integrated with the Honeywell Information Systems 
(HIS) 645 sequential computer, which runs under 
the Multics time-shared operating system. 


Introduction 


The potential use of RADCAP and its hard- 
ware architecture are described in companion 
papers presented at this conference (1) (2). Basi- 
cally, the RADCAP facility consists of an opera- 
tional STARAN associative array (parallel) proc- 
essor (2) (3) and various peripheral devices, all 
interfaced with a 645 computer. 


There are two modes of RADCAP operation. 
First, STARAN can be operated as a stand-alone 
parallel processing system. Peripherals for this 
mode include a card reader, line printer, paper 
tape reader and punch, and cartridge type disk 
unit. Second, STARAN and the 645 can be oper- 
ated in an integrated fashion. This means that (1) 
commands to the STARAN disk operating system 
can originate in Multics, (2) the Multics storage 
system is available to STARAN users for program 
or data storage, and (3) a single task can use both 
machines to satisfy its processing requirements. 
All peripherals belonging to a stand-alone 
STARAN and to the HIS 645 are available when the 
machines are integrated. 


, This paper describes the software for the 
STARAN stand-alone mode of operation, then 
covers the additional software used with the inte- 
grated mode. 


Since the STARAN processor architecture is 
detailed in a companion paper (2), only a basic dia- 
gram is givenin Figure l. The multi-dimensional 
access associative arrays and their controls are 
the main architectural features. The sequential 
control, a Digital Equipment Corporation (DEC) 
PDP-11 minicomputer, has a minor role in the 
architecture, but is important for software con- 
siderations. Other architectural features are 
mentioned later in the paper. 


(a) 


I'M, Goodyear Aerospace Corporation, 
Akron, Ohio. 
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Software For STARAN Stand-Alone Mode 


Software for the STARAN stand-alone mode 
of operation can be discussed from the standpoints 
of the operating system, language processing, and 
operational software. 


Batch Disk Operating System 


In this paper, an operating system means the 
collection of routines that give the user appropri- 
ate control of the computing system, inform him 
of system status, provide input /output (I/O) facil- 
ities, and provide access to system programs. 
STARAN features a disk operating system (DOS) 
and has a batch processing capability. The batch 
command stream can be assigned to any ASCII 
character input device, allowing control to origi- 
nate at the control console or from a user's file 
on the batch device. 
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Figure 1. STARAN Block Diagram 
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The disk is a file structured bulk storage 
medium. All system software is resident on the 
device for easy, rapid access by the user. 


Listed below are the standard programs 
supplied with the DEC PDP-11 batch system: 


Program Name Function 

MACRO Macro-assembler 

LINK Linker 

LIBR Librarian 

PIP File utility package 

EDIT Text editor 

ODT On-line debugging package 
FORTRAN Fortran compiler 


These programs are not covered in detail since 
primary emphasis in this paper is on the STARAN- 
related software that has been added to the above 
list to build the STARAN disk operating system. 


One general rule used in software develop- 
ment was to avoid changes to the basic DEC batch 
system. This rule was intended to simplify any 
future change to a new DEC release. 


Language Proces sing 


APPLE. Programs for STARAN are written 
in the APPLE assembly language (Associative 
Processor Programming Language). This lan- 
guage has some mnemonics that generate one 
machine language instruction and others that gen- 
erate a sequence of machine instructions (5). The 
one-to-many mnemonics generally implement a 
parallel algorithm for arithmetic or search oper- 
ations using the arrays. Thus, APPLE is ata 
higher level than sequential machine assembly 
languages. 


APPLE produces relocatable or absolute 
program sections and has a conditional assembly 
capability. Groups of instructions in the language 
are listed below: 


1. Assembler directives 

2. Branch instructions 

3. Register load and store 
4. Associative instructions 


a. Loads 

b. Stores 

c. Parallel searches 

d. Parallel moves 

e. Parallel arithmetic operations 


5. Control and test instructions 
6. Input/output (I/O) instructions 


Most of these groups of instructions resemble 
those of other typical assemblers. The unique 
group - associative instructions - deals with oper- 
ations on the multi-dimensional access arrays and 
the registers in their processing elements (PE). 


(b) ru, Goodyear Aerospace Corporation, 
Akron, Ohio. 


Some general comments apply to all the associa- 
tive instructions listed above. Operations take 
place only on arrays enabled by the array select 
register (2). Fields are of variable length within 
each array word and are defined for various in- 
structions by field pointers and length counters. 
The common register, a part of associative con- 
trol, can contain an operand which is used in com- 
mon by all selected array words. 


More detail is presented below on the associ- 
ative instructions; i.e., loads, stores, parallel 
searches, parallel moves, and parallel arithmetic 
operations. 


The load associative instructions load the 
processing element (PE) registers or the common 
register with data from the arrays. Logical oper- 
ations may be performed between the current PE 
register contents and the array data. The language 
has mnemonics for the common logical operations, 
while the machine supports all 16 functions of two 
logical variables. A given load instruction can 
increment, decrement, or leave as is an array 
field pointer. Thus, a single one of these instruc- 
tions can load registers, perform logic, and change 
pointer values. Operations to set, clear, or ro- 
tate the PE register are included in this group. 


The store associative instructions are used to 
move PE or common register data into the arrays. 
A mask feature is provided that allows writing 
only in mask enabled array words. As with the 
load instructions, logical operations may be per- 
formed between the current PE register contents 
and the array data. Also, the array field pointer 
can be incremented, decremented, or left un- 
changed. 


The parallel search associative instructions 
allow the programmer to search for particular 
conditions in the arrays. Only those words enabled 
by the mask register take part in the searches. 
Searches can be performed that compare a value 
in the common register with a value in a field of 
all array words. Another variety of search com- 
pares one field of a word with a second field of the 
same word for all array words. Comparisons can 
be made for such conditions as equal, not equal, 
greater than, greater than or equal, etc. Maxi- 
mum and minimum searches also can be perform- 
ed. Combinations of searches yield such functions 
as between limits and next higher. Additional 
mnemonics in this group are provided to resolve 
multiple responders to the searches. 


The parallel move instructions are provided 
to move an array memory field to another field 
within the same array word. As with searches, a 
word is active for this instruction only when ena- 
bled by the mask register. Types of moves are 
direct, complement the field, increment or decre- 
ment the field, and move the absolute value. 


The parallel arithmetic operation associative 
instructions allow the programmer to perform such 
parallel operations in the arrays. These opera- 
tions are subject to mask register word enabling. 


154 


1973 SAGAMORE COMPUTER CONFERENCE ON PARALLEL PROCESSING 


ee panrrr  P SA SS S SS SSSA SA 


Arithmetic can use a value in the common register 
as one operand and a value in a field of all array 
words as the parallel operand. Alternatively, one 
field of a word can be arithmetically combined 
with a second field of the same word for all array 
words. Operations supplied by APPLE are add, 
subtract, multiply, divide, and square root. 


Macro. A macro language is provided to in- 
crease the user's flexibility at assembly time (6). 
The macro language has a large set of arithmetic, 
logical, relational, and string manipulation opera- 
tors. Adding macro variable symbol handling, 
conditional expansion capability, and ability to nest 
macro calls make it possible to write powerful 
macro instructions. A system macro library 
feature has been implemented. 


Benefits to the user are the ability to define 
new mnemonics, redefine existing mnemonics, and 
conveniently generate standard instruction sequences. 


Mnemonics have been added to the basic 
APPLE language for RADCAP by writing macros 
and putting them in the system library. Primarily, 
the added mnemonics are floating point instructions. 
They are fixed field length operations in both single 
and double precision. 


Building Load Modules. Software used to con- 
vert source language programs into executable 
load modules includes an APPLE assembler, 
macro-preprocessor, and relocating linker. Fig- 
ure 2 shows this software and the flow of programs 
or modules through it. 


Building load modules begins with the original 
program written in APPLE. This source program 
may contain macro instructions. Translation of 
the source into a machine language object module 
is by MAPPLE, (APPLE assembler with Macro- 
preprocessor on the front end). If it is known that 
the source program does not contain macro instruc- 
tions, it is possible to input the source directly to 
the APPLE assembler. 
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APPLE SOURCE 
WITH NO MACROS 


Figure 2. 
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A relocatable object module is converted to 
an absolute load module by the STARAN linker. 
Multiple object modules may be input to the linker 
since it has the function of resolving symbols de- 
fined across object module boundaries (global 
symbols) as well as adjusting addresses for relo- 
cation. 


Use of the language processing software is 
fully described in the STARAN user's guide (7). 


Operational Software 


Operational software is discussed below from 
the standpoints of loading, executing, and debug- 
ging programs on STARAN. Four modules are 
involved: loader plus STARAN program super- 
visor, debug module, and control module. 


Loader. Output of the STARAN linker is shown 
in Figure 2 as an absolute load module. The loader 
has the straightforward task of moving a load mod- 
ule into STARAN control memory beginning at the 
address specified in a text block. Options on load- 
ing are to load and not execute or to load and begin 
execution either at an address given with the load 
module or at one given with the load command. 


The load module can be linked with a user pro- 
gram to enable calling for a load from an executing 
program. This means that overlay modules can be 
brought in dynamically. 


STARAN Program Supervisor (SPS). The SPS 


is the software interface between the associative 
and sequential portions of STARAN. This module 
has services for system users when programming 
in APPLE and when programming a PDP-11 rou- 
tine to interact with an APPLE program. 


For the APPLE program, SPS makes the I/O 
instructions of the disk operating system (DOS) 
available, provides a program overlay capability, 
and provides a programmable interrupt to a PDP- 
11 routine. The PDP-11 routine interacts through 


OBJECT 
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STARAN 
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ADDITIONAL 
OBJECT MODULES 
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a software link, which receives the APPLE inter- 
rupts, and through the issuing of control informa- 
tion to the associative control logic. 


In addition, the SPS supplies interface ser- 
vices. It transfers data between associative and 
sequential memory through the common memory 
window (Figure 1). The SPS also fields associative 
processor error interrupts. 


Concurrent execution of associative and sequen- 
tial routines, with interaction, is made possible by 
the SPS. 


STARAN Debug Module (SDM). The SDM helps get 


rid of bugs in APPLE programs by giving the user 
control of the execution of the program being de- 
bugged, and access to memory and registers. Such 
features as single step, trace, and breakpoint pro- 
vide good execution control. Dumps of all memory 
areas can be taken, with both word slice and bit 
slice available for the multi-dimensional access 
arrays. All memory locations also can be modified. 


STARAN Control Module (SCM). This final opera- 


tional module is the interface between the user and 
execution of aSTARAN program. By running the 
SCM, the user enters a mode in which STARAN 
related commands are recognized. Such commands 
as start, halt, and continue execution are processed 
directly by the SCM. When the load command is 
used, the SCM passes control to the loader for that 
function. If debug aids are needed, a simple com- 
mand adds all debug module features to the SCM. 


All the operational software modules are de- 
scribed more fully in the STARAN user's guide (7). . 


Software for STARAN/645 Multics Mode 


General 


In the RADCAP facility, the integrated use of 
the STARAN parallel processor and the 645 sequen- 
tial computer makes additional software necessary. 
One major concern is the interface between the 
computers; this requires a software module in both 
machines. A second concern involves reasonable 
ease of use for the integrated mode; four procedure 
packages that execute totally in the 645 were added 
to satisfy this concern. 


Figure 3 shows the relationship between soft- 
ware modules in STARAN and the 645. As indica- 
ted, the Multics time-shared operating system of 
the 645 contains three categories of software: 
command level, user process, and system related. 
Command level software is brought into execution 
by user-supplied commands, as from a Multics 
terminal. User process software consists essen- 
tially of subroutines called from a user program. 
System-related software is the collection of rou- 
tines that support use of the system, such as 
handling input and output, and are usually called 
indirectly by the user program. 


Additional details on the design and use of soft- 
ware are described in the STARAN/645 user's guide 


(8). 
Interface Modules 


The two modules for the interface, shown in 
Figure 3, are the 645 device driver in the STARAN 
batch disk operating system (DOS) and the STARAN 
device interface module (DIM). These modules 
are discussed below. 


645 Device Driver. This driver provides the 
interface between the DOS monitor and the 645 
computer. It communicates with the monitor as 
do other device drivers for standard peripherals. 
If the device looks like an input for character in- 
formation, then batch commands can come from it. 
The batch stream can be assigned to the device. 
This is the significance, for Multics, of the batch 
feature on the DOS. 


In reality, the device treated by the 645 driver 
is used for much more than character input. The 
645 appears as three logical devices with unit num- 
bers O, 1, and 2. 


Unit 0 looks like the disk, logically. Before 
transferring data, it is necessary to "open" a file 
using a file name and extension in the DOS format. 
The driver supports both ASCII and binary trans- 
fer modes, both formatted and unformatted. A 
data-set remains open until a ''close'' call is issued. 
At any one time, up to 14data-sets may be open 
on unit 0. 


Unit 1 looks like a card reader, logically. It 
is a read-only device with an ASCII transfer mode. 
This unit serves as the batch command stream in- 
put so a Multics user can control the system. 


Unit 2 looks like a paper tape punch, logically. 
It is a write-only device with ASCII and binary 
transfer modes. Job log output, in the integrated 
mode, is always assigned to this unit. 


Because of the nature of the 645 device and its 
expected usage, the device driver has two custom 
functions built in. An "idle"! function is used to 
tell Multics when the command stream file has been 
processed. A "detach" function, called when a 
Multics user detaches from STARAN, performs 
cleanup and makes STARAN ready for a new Multics 
user. 


OTARAN DIM. In Multics terminology, a 
device interface module (DIM) coordinates com- 
munications with a particular physical device. 

The four major functions are performed by the 
DIM are: (1) attachment, (2) read command 
from STARAN, (3) respond to STARAN command, 
and (4) detachment. 


Attachment is the function through which a 
user process gains access toSTARAN. The inter- 
face is initialized by a call to the attachment entry 
point in the DIM. STARAN is available as a Multics 
system resource to only one process at a time. 
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Therefore, further calls to the DIM can be made 
only after a successful attachment. 


The function of reading a command from 
STARAN occurs after attachment. The first com- 
mand should be to read ASCII on the command 
stream input device. As noted above, Unit 1 of the 
645 device driver is provided for this purpose. 


Once the initial sequence is past, the DIM must 
respond to STARAN commands. The call made by 
the Multics process is determined by the previous 
STARAN command. For example, if STARAN 
issues a read call, Multics must write. 


Finally, the detachment function severs the 
link between the user process and STARAN. 


Data manipulation by the DIM assumes all 
Multics data is in character form. It converts 
characters into the form needed for output to 
STARAN and converts data received from STARAN 
into Multics character form. This means, for 
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example, that Multics arithmetic data must be con- 
verted to a character form prior to output, and 
from characters following input. The conversion 
is done by a procedure superior to the DIM. The 
DIM also handles retransmission of bad data and 
reports a failure to its caller after a specific num- 
ber of unsuccessful tries on the same data. 


In the Multics software structure, the DIM is 
located in a position inferior to the file control 
procedures, shown in Figure 3 and described in 
the next part of this paper. 


System Use Modules 
File Control Procedure (FCP), The FCP 


greatly simplifies operation of STARAN from 
Multics. It enables a Multics user process (pro- 
gram) to interact with STARAN by initializing the 
interface, handling communication between the 
machines, and terminating the interface. The 
FCP also makes the necessary calls to the DIM to 
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initialize and terminate the interface. Communi- 
cation is described in the following paragraphs. 


Once the interface has been established,Mul- 
tics appears to STARAN as a set of three logical 
devices, defined above as Units 0, 1, and 2. 


Unit 0 is like a disk. All operations on this 
file-structured device are initiated within STARAN 
by I/O instructions and are performed within Mul- 
tics. The FCP represents the interface between 
the DIM and Multics storage for all file operations. 
It handles the opening and closing of files, makes 
file names known to Multics, and issues appro- 
priate calls to the DIM for read and write opera- 
tions. 


Unit lis like a card reader. It is the source 
of batch stream commands to the STARAN opera- 
ting system. The FCP must recognize requests 
for these commands, read the commands from the 
source in Multics, and write them to STARAN. 
The source can be either a Multics terminal or 
named file. All calls to the DIM are made by the 
FCP. 


Unit 2 is the destination of job log output. The 
FCP sorts this out and directs it to a Multics ter- 
minal or named file. Again, all calls to the DIM 
are handled by the FCP. 


With FCP, a user process, executing in the 
645, can call for STARAN, and it can pass com- 
mands, programs, and data toSTARAN. The FCP 
raises the point at which the user becomes involved 
from sequences of calls to the DIM to a more sym- 
bolic call to FCP routines from the user process. 


SOTARAN Command. User involvement in the 
interface to STARAN is raised still higher from 
the user process to the Multics command level by 
a "'"STARAN" module. Essentially, this module is 
a supplied user process that passes parameters 
used in the terminal command to the FCP. The 
parameters identify the STARAN batch command 
stream input and output devices. The module calls 
appropriate FCP routines to establish interaction 
with STARAN. 


In typical operation of STARAN from a ter- 
minal, this Multics command is used with STARAN 
commands also coming from the terminal. Ini- 
tializing and terminating the interface are not a 
concern of the user. The Multics terminal be- 
comes very similar to the STARAN control console 
when this module is used. 


Arithmetic Format Routines (AFR). STARAN 


and the 645 differ in the lengths of their data 
representations. STARAN has a 32-bit control 
memory, while the 645 has a 36-bit word length. 
Arithmetic format routines are provided to con- 
vert either integer or floating point data between 
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the 645 format and the format used by the DIM for 
transmission to STARAN. 


In the Multics to STARAN direction, integer 
data are converted by truncating the most signifi- 
cant four bits. A check is made to verify that the 
integer can be represented in 32 bits. Floating 
point data are converted by truncating the least 
significant bits of the mantissa. 


From STARAN to Multics, integer conversion 
is done by extending the sign bit. Floating point 
conversion is done by filling the low order mantissa 
bits with zeros. 


Cross Assembler. This is a functionally 
equivalent version of the MAPPLE assembler, 
written in PL/1, to be run in Multics. It is avail- 
able to terminal users on a time-shared basis. It 
accepts APPLE and macro statements and pro- 
duces STARAN object code in the Multics character 
format required by the DIM for transmission to 
STARAN. 


Conclusion 


A brief description has been given of the soft- 
ware that makes up the operating system for oper- 
ational STARAN associative array processor in- 
stalled in the RADCAP facility. Also described is 
the additional software that makes STARAN opera- 
tional when integrated with 645 sequential computer. 
The goal of all the software is to provide tools to 
use STARAN in the stand alone and integrated modes. 
The tools are intended to increase convenience for 
the user and improve total system throughput. 


Many modules have been discussed. Some of 
these are essentially transparent to the user, 
some may not be needed by certain users, and 
some may be required by all users. For stand- 
alone STARAN operation, the programmer must 
know APPLE and the use of the assembler and 
linker. He must be able to run the control module 
and load programs. He will probably be interested 
in the debug module. The STARAN program super- 
visor is transparent for most users. It is not 
necessary to know any of the sequential control 
program or languages. 


To use STARAN from a Multics terminal, the 
only additional requirement is to know how to 
connect STARAN and Multics using the Multics 
"STARAN" command. If the user wishes to have 
a Multics user process (i.e., a program) interact 
with STARAN, then the calls to the file control 
procedures and use of the arithmetic format 
routines become important. The 645 device driver 
and the STARAN DIM are transparent to users. 
The cross assembler is a convenience for Multics 
users and may be used instead of the assembler 
in STARAN. 
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APPLICATION OF STARAN TO SUPPORT REGION ANALYSIS 
FOR A MECHANICAL ROBOT 


J. M. Plante and D. J. Gondek 
Rome Air Development Center (IRDA) 
Griffiss Air Force Base, New York 13441 


Summary 


For the past seven years the Advanced 
Research Projects Agency (ARPA) has sponsored 
research at Stanford Research Institute (SRI) in 
the area of artificial intelligence. The primary 
goal of this project has been to investigate 
techniques in artificial intelligence applied to 
the control of a mobile automaton (robot) in a 
real environment. The main emphasis has been on 
the design of a hierarchy of algorithms that will 
accept visual and other sensory information 
gathered by the automaton. Specifically 
algorithms are developed to support the analysis 
of the controlled environment in which the auto- 
maton resides (1). The potential application of 
STARAN to support a selected subset of these 
algorithms (i.e. Region Analysis) was investigated 
and is summarized in this paper. 


The Region Analysis algorithm uses a decision 
tree. Nodes in the tree correspond to an operator 
to be applied, and branches emanating from a node 
correspond to the results of that operation. Any 
path through the decision tree eventually leads 
to a terminal node corresponding to a description 
of the location, and possibly the identification 
of an object in the scene. Repeated passes 
through the tree produce a list of such informa- 
tion describing the scene. 


The Region Analysis algorithm is designed to: 


(1) Assign region numbers and identify 
related neighbors within the overall environment. 


(2) Assign scores for "Best Guess Region 
Type". This information is derived from the afore 
mentioned Scan and Merging Heuristic Algorithms. 


(3) Object identification within regions, 
as related to the overall environment. 


The data (identified regions) is then used as 
input for further Scene Analysis before being 
passed to the main body of robot programs (i.e. 
question-answering, navigation/route plotting, 
problem solving, etc.) 
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The Scene Analysis program as executed on a 
sequential computer, uses a number of special 
purpose subroutines to extract evidence from or 
to apply to a picture/scene. These low-level 
routines operate in a quasi-intelligent fash- 
ion, in that they perform some operation and 
return an answer based on previous results and 
the sensibleness of their answers. 


The highly iterative Region Analysis algor- 
ithm (which are a subset of the Scene Analysis 
algorithms) have not been currently implemented 
on any conventional sequential machines due to 
the excessive computational time required to 
execute them. Since this particular subtask per- 
forms many repetitive sequential operations 
which collect very similar samples/packets of 
related data elements, parallel processing tech- 
niques for performing the Region Analysis 
functions were investigated. The conclusion of 
the study was that the application of the STARAN 
Associative Processor is a viable solution which 
readily lends itself to this programming task. 


For information concerning the design and 
operation of the mechanical robot and supporting 
programming subtasks consult references (1) and 


(2). 


For supporting technical data on the STARAN 
Associative Processor System, the reader is dir- 
ected to reference (3). 
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A DATA MANAGEMENT SYSTEM 
UTILIZING THE STARAN 
ASSOCIATIVE PROCESSOR 


Richard Moulder 
Digital systems, D/472 
Goodyear Aerospace Corporation 
Akron, Ohio 44315 


SUMMARY 


An on-line data base management system (DBMS) 
utilizing the STARAN Associative Processor has 
been designed and implemented at Goodyear 
Aerospace. The hardware configuration is composed 
of Goodyear's STARAN S-1000 with a parallel head- 
per-track disc (PHD) and a Xerox Data Systems 
Sigma 5 computer. Communications between the two 
computers is via Direct Memory Access (DMA). The 
PHD is for peripheral data storage and consists 

of a single disc with 64 tracks. Each track has 

a head and read/write electronics. This design 
allows data to be read into or out of the associa- 
tive arrays over a communications channel which is 
64 bits wide. 


A four level hierarchical data base was selected 
and implemented in our DBMS. The technique used 
for actually storing the data on the PHD was the 
Associative Normal Form (ANF) suggested by 
DeFiore and others [1] - Employing ANF we 
developed a data base having no external indices 
and no organization by record type. This allowed 
a significant saving in peripheral storage with 
little or no degradation in query or update 
response times. This was made possible because 
of the parallel input/output and parallel content 
searching capabilities of STARAN. The benefits 
of a fully inverted data base were achieved with- 
out the attendant increase in peripheral storage. 


The software system was composed of four basic 
modules. These modules can be found in most DBMS 
and are the Define, Create, Interrogate, and 
Update Modules. The Define module describes the 
logical data structure to the computer system. 
In our implementation, the Define module was 
similar to IBM's GIS/2 [2]. The Create module 
populates the data base by mapping the logical 
data structure to the ANF and writing the data 
to the PHD. The most used modules are the 
Interrogate and Update modules. These modules 
are used via a graphic display console to query 
and change the data base. A non-procedural 
language tailored after SDC's SACCS Data Manage- 
ment System [3] was employed. Any data item 
or attribute of a record can participate in the 
search criteria with multiple criteria being per- 
mitted. Besides the standard query and update 
functions that were provided, an additional 
function called "Move" was introduced. This 
command allowed the restructuring of the hier- 
archical data base without going through a 
"Delete" and an "Add." 
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A user's request is typed on the graphic display 
terminal and then transmitted to the Sigma 5. 
The request is passed through an input valida- 
tion software module. Following validation, the 
request is processed by a translation module. 
This translation includes the restructuring of 
the selection criteria according to the rules 
for Reverse Polish Notation. A task list of I/0 
functions involving the search criteria is con- 
structed and transmitted via DMA to the STARAN. 
The task list is executed and records that satis- 
fy the search criteria are transmitted back to 
the Sigma 5. Information is extracted from the 
records, formatted, and displayed. 


Our results to date show that associative proc- 
essors working in concert with sequential proc- 
essors performing in a DBMS environment are an 
excellent marriage of two computer concepts, 
With multiprocessing capabilities, greater 
throughput can be achieved. Timing results show 
that for the implemented data base, query and 
update times are nearly equal. Our results also 
show that a DBMS employing Associative Processors 
will require less software. This is due to the 
simplicity of the data storage techniques. For 
a more detailed description of the Deta Manage- 
ment System implemented on STARAN, the reader 

is directed to references [4] and [5] . 
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INTRODUCTION TO THE ARCHITECTURE OF A 288-ELEMENT PEPE 
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Abstract -- The PEPE (Parallel Flement 
Processing Ensemble) is a parallel-associative 
processor which can attain order-of-magnitude 
performance and cost-effectiveness improvements 
over conventional machines when employed on 
problems containing inherent parallism. This 
paper describes the architectural features of a 
new large-scale PEPE system now being constructed 
to operate with a CDC 7600 Host. 


General Description 


When compared with conventional sequential 
multi-processing computers, PEPE provides much 
faster data processing rates. It does this at 
relatively low cost and with inherent reliability 
since its architecture is made up largely of 
disconnectable, rather simple but triple-process- 
ing element modules which are replicated many 
times throughout the design. Failure in any one 
element affects neither the remaining hardware nor 
the software. 


Each PEPE element may simultaneously respond 
to instruction execution microsteps from each of 
three control units. Therefore, a 288-element 
PEPE may effectively execute up to 864 instruc- 
tions simultaneously. 


Elements may be added to the configuration 
if required, with no effect on the software. The 
capability of associative addressing allows the 
software to be indifferent to the number of 
elements that are present. Individual elements 
are activated or deactivated from participation in 
the execution of an algorithm based upon compari- 
sons of sequential and/or parallel data. 


PEPE uniquely puts its parallel processing 
capabilities to work by providing completely over- 
lapped input and output functions. The current 
large-scale PEPE model provides architecture to 
interface the parallel processing environment with 
a computing world which is sequentially oriented. 
Input/output conversion units, an input correla- 
tion control unit and an associative output 
control unit are utilized to allow the parallel 
arithmetic architecture to execute virtually with- 
out I/O overhead. Within the correlation and 
associative output control units data are block 
transferred from and to external devices simulta- 
neously with the transfer of other data into or 
out of selected elements. The PEPE then is a 
complete parallel data processing system providing 
an unrestricted throughput relative to its 
parallel arithmetic capabilities (see Figure 6). 
[1] [2] 


Host Interface 


Although the current model PEPE will contain 
its own instructions, programs, interrupt mechan- 


isms, clocks, etc., a close interface with a stan- 
dard sequential computing system is desirable for 
quickly processing non-array-oriented portions of 
problems and for peripheral device control. This 
sequential system may also be used for utility 
functions such as compiling PEPE programs. For 
these purposes, the current PEPE configuration 
will utilize the ABMDA Research Center CDC 7600 
computer which is connected to PEPE through three 
MUX (Input/Output Multiplexor) channels. (a) 


PEPE Instructions 


Both sequential control and parallel instruc- 
tions can be intermixed in a program unit. The 
sequential instruction repertoire is required for 
program control functions and includes branching, 
I/O, active element count, and a limited data 
conversion capability (shift, mask, integer 
arithmetic). The parallel instruction repertoire 
includes two types of instructions: those which 
select element activity, and functional instruc- 
tions such as floating point and integer arithme- 
tic, shift and mask. The floating point capabil- 
ity in the Arithmetic Units includes floating 
point - integer conversion instructions and a 
Square root instruction (see Figure 1). The 32- 
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Fig. l. PEPE Instruction Format 


(2) 7600 PPUs (Peripheral Processing Units) are not 


utilized for this connection because of execu- 
tion time penalties. 


This work was supported by the U.S. Army Advanced 
Ballistic Missile Defense Agency (ABMDA), 
Huntsville, Ala., under Contract DAHC60-73-C-0060. 
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bit instruction format has an 8-bit op code field 
(0), a 1-bit memory unit selection field (M), a 
3-bit routing field (R), a 4-bit index register 
field (X) (there are 15 index registers in each 
control unit), and a 16-bit address field (A). 
The routing field determines the source of the 
operand and whether the instruction is sequential 
or parallel. The instruction "Load A-register," 
for instance, can cause the sequential control 
unit A-register or one or more parallel A-regis- 
ters to be loaded depending upon the routing field 
setting. If a parallel routing is specified, 
parallel element A-registers will be loaded only 
in "active" elements set by a previous "select" 
instruction. 


PEPE 
CONTROL 
CONSOLE 


CDC 7600 


Fig. 2. 
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Detailed Description 


Physical Configuration 


The PEPE design will accommodate 288 process- 
ing elements partitioned into eight element bays. 
The element bays are installed radially to reduce 
cable length. Current plans call for the instal- 
lation of only one element bay containing 36 
processing elements. All processing element 
operations are controlled from the control console 
which also provides the interfaces required for 
connection to the CDC 7600 and test and mainte- 
nance equipment. PEPE will be implemented with 
standard emitter-coupled logic (ECL) contained in 
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PEPE Physical Configuration 
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eat ae Se are 


Dual In-Line Packages (DIPs) which are mounted on 
multilayer printed circuit boards. The printed 
circuit boards will be approximately 16" x 18" and 
will have an average component density of 275 DIPs 
in the element bay and a maximum of 150 DIPs in 
the control console. PEPE will be cooled by means 
of forced air and chilled water. Figure 2 illus- 
trates the PEPE physical configuration. 


Control Console. The control console dimen- 
sions are approximately: 


80" high 
50" wide 
26" deep 


Power dissipation is approximately 6000 watts. 


The element bay dimensions are 


Element Bay. 


approximately: 
80" high 
78" wide 
26" deep 


Power dissipation is approximately 30,000 watts 
per element bay. 
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OUTPUT DATA CONTROL 


OUTPUT 


UNIT UNIT 


ELEMENT 
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Fig. 3. 
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Test & Maintenance Equipment. A Burroughs 
B1714 computer will be utilized for dynamic test 


and maintenance of the PEPE system and its indi- 
vidual printed circuit boards and processing 
elements. 


PEPE Processing Element 


Each processing element (PE) contains an 
Arithmetic Unit, Associative Output Unit, 
Correlation Unit and Element Memory as shown in 
Figure 3. The PE contains no instruction execu- 
tion control logic and must receive all timing 
and control signals from the control console. 


The ensemble of processing element units 
receives timing and control signals from corres- 
ponding control console execution units as 
follows: 


PE Unit 
Arithmetic Unit 
Associative Output 

Unit 
Correlation Unit 
Element Memory 


Control Console Unit 
Arithmetic Control Unit 
Associative Output Control 

Unit 
Correlation Control Unit 
Element Memory Control 


CORRELATION CONTROL UNIT 


ELEMENT MEMORY 
CONTROL 


CONTROL 
ASSOCIATIVE OUTPUT CONSOLE 
CONTROL UNIT 

i : | 

SIGNAL 

i | DISTRIBUTION | 

| SYSTEM ; 

! i | ' 

ASSOCIAT 

ARITHMETIC SOU TINE CORRELATION ELEMENT 


UNIT 


PEPE Processing Element 
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Each PE unit (except element memory) contains Element Memory. Each element memory (EM) 
an Activity register (one bit). When a control consists of 1K words of ECL storage and receives 
unit performs a parallel instruction all corres- address and mode information from the control 
ponding active (Activity register = "1") PE units console Element Memory Control (EMC). All 
respond so that a maximum of 288 PE units may ensemble EMs receive identical information from 
simultaneously execute that instruction. Since EMC during execution of a particular parallel 
the processing element contains three computa- instruction. EM is connected to the AU, AOU, and 
tional units corresponding to three independent CU by means of a common data bus and consequently 
control units, an ensemble of 288 PEs may be EMC directs the sharing of element memory with 
responding to three simultaneous and independent the following priority assignments: (1) CU, 
parallel instructions thereby effectively execut- (2) AOU, (3) AU. This priority scheme has been 
ing 864 simultaneous (subject to element memory established since the CU instructions tend to be 
conflicts) instructions. short (200-300 nanoseconds) and the AU instruc- 

tions tend to be considerably longer (floating 

An Activity Stack has been added to the point multiply requires 1.9 microseconds). Pro- 
Arithmetic Unit and Associative Output Unit. It gram execution times are expected to increase by 
is a 21-level hardware implemented "push-pop" no more than 5% due to element memory conflicts. 
stack connected to the Activity register. The Simulation experiments have shown that reversing 
Activity Stack is used to save and restore multi- the priority order greatly increases program 
ple subsets of PE units. execution times. 

All processing element units contain an PEPE Control Console 
8-bit, bit addressed Tag register which is used 
to perform associative matches on data received The control console provides instruction 
from the control unit. execution control for the entire PEPE. It 

contains three control units (see Figure 4) which 

Arithmetic Unit. Each Arithmetic Unit (AU) are connected to the ensemble of processing 
contains conventional A (accumulator), B (operand) elements as described above. Additionally, the 
and Q (quotient or product) registers which control console contains functional units which 
support execution of the parallel integer, logical support the following operations: 


and floating point instructions. The AU A-regis- 
ter is additionally utilized to provide associa- 
tive output to its control unit via a data bus 
shared with the Associative Output Units. Various 
"select" instructions operate upon the AU Activity 
register, Activity Stack and Tag register to 
determine which Arithmetic Units participate in 
subsequent parallel instructions and to remember 


Inter-control unit interrupts 
Error recovery 

Processing element output 

Flement memory conflict resolution 
Maintenance and diagnostic tasks 
Input/Output data conversion 


[+ n° a © © © A © 


and restore previously active element sets. The system function of each control unit is: 
Associative Output Unit. Each Associative ° ACU - Manipulates the parallel data base 

Output Unit (AOU) contains conventional A and B contained in the ensemble of element 

registers which support execution of parallel memories. 

integer and logical instructions. The AOU °AOCU - Outputs data resulting from parallel 

A-register is additionally utilized to provide data base manipulations. 

associative output to its control unit via a data ° CCU - Inputs new data. 

bus shared with the Arithmetic Units. Various 

"select" instructions operate upon the AOU Control Units. The three control units (ACU, 

registers exactly as in the AUs. AOCU, CCU) are of a common design which is 

functionally configured as shown in Figure 5. 

Correlation Unit. Each Correlation Unit (CU) Fach control unit has its own program and data 

contains a B-register and 16 Correlation registers memory. Communication between control units may 

(contained in a 16-word ECL RAM). These registers occur via the Intercommunication Logic Unit (ICL) 

support execution of parallel integer and logical as illustrated in Figure 4. 

instructions. Correlation register-to-register 

operations are permitted. No means are provided Programs are executed from the program memory and 

for the CU to output data to its control unit. consist of any sequence of: 

Various "select" instructions operate upon the CU 

Activity register and Tag register to determine ° sequential instructions to be executed in the 

which Correlation Units participate in subsequent Sequential Control Logic (SCL) 

parallel instructions. There is no Activity Stack ° parallel instructions which are routed through 

in the CU since the correlation process requires the SCL to the Parallel Instruction Control 

the rapid identification of elements in which to Unit (PICU) 

store new data, rather than maintenance of a 

history of previous sets of activity as in the AUs The SCL contains conventional A,B,Q and 

and AOUs. index registers which support execution of 


sequential integer, logical and branch instruc- 
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tions. 
which: 


It also responds to parallel instructions 


cause output from the PE (except CCU) 

allow branching based upon element activity 
cause inter-control unit interrupts 

support error recovery 
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Sequential instruction operands may be con- 
tained either in the instruction or in program/ 
data memory as specified by the appropriate 
instruction fields. 


Parallel instructions (with indexable 
operands) are routed to the PICU which is a micro- 
programmed execution unit in which the micropro- 
gram memory outputs are utilized to control the 
switching networks in the processing element. 

When required during execution of a parallel 
instruction the PICU transmits address, request 


ELEMENT 


MEMORY 
CONTROL 
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PEPE Control Console Components 


and mode data to element memory control. It then 
transmits a data strobe to the PE when an acknow- 
ledge is received from EMC indicating that the 
PICU has been selected for EM service. Parallel 
instruction operands may be contained either in 
the instruction or in element memory as specified 
by the appropriate instruction fields. 


Because of its large (32K) program memory, 
ACU cycle time is a relatively slow 200-300 ns 
(other program/data memory cycle times are 100ns). 
Moreover, the ACU has responsibility for execution 
of relatively slow parallel floating-point 
instructions, so the ACU parallel instructions are 
routed through a 16-step queve (Parallel Instruc- 
tion Queue) prior to execution in the ACU-PICU. 
This queue effectively speeds up average ACU 
instruction rate. 
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Input/Output. Each control unit has two 
Input/Output Units (I0U) to provide for control 
and fully duplexed data transfer to and from the 
CDC 7600 and test and maintenance equipment. 
These I0Us are capable of: 


° Block transfer of data to control unit 
program/data memory initiated by CDC 7600 
(T&M equipment) 

° Block transfer of data from (to) control 
unit program/data memory to (from) 

CDC 7600 T&M equipment) initiated by 
sequential instruction execution 

° Control unit interrupts 

° Control unit start/stop (master clock) 


IOU capability has been expanded to allow 
overlap of IOU data transfer with parallel/ 
sequential instruction execution. This feature 
alone is responsible for halving the time it takes 
to correlate new data received by the CCU with 
existing data residing in element memory. 


Element Memory Control (EMC). EMC receives 


requests from the three control unit PICUs for 
element memory service. It performs any needed 
conflict resolution, transmits required control 
information to the ensemble EMs and responds to 
the PICU when the selected EMs have been properly 
switched to service the AU, AOU, or CU. 


Output Data Control (ODC). ODC receives 
requests from the ACU/AOCU to transfer AU/AOU 
A-register contents to the A-register in the ACU/ 
AOCU SCL. It performs conflict resolution and 
places the active AU/AOU A-register contents on a 
common data bus to the control console. ODC then 
transmits an acknowledge to the ACU/AOCU SCL to 
achieve the data transfer. More than one AU/AOU 
in an active state will cause an error condition 
to be processed by the Inter Communication Logic. 


Inter Communication Logic (ICL). ICL pro- 


vides the mechanism for: 


° AOCU interrupt of the ACU 

CcU interrupt of the ACU 

Control unit interrupts from IOU 

ACU control of control unit registers 
Error interrupts from ACU 

Real-Time Clock 

Interval Timer 

System data collection 
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Neither the AOCU nor the CCU have floating 
point instructions. Therefore, they have been 
given the capability to interrupt the ACU in order 
to execute subroutines which require floating 
point manipulations. The ICL prevents interrupt 
"nesting" by either AOCU or CCU, and contains four 
registers (two each for the CCU and the AOCU) 
which may be utilized for inter-control-unit 
interrupt data transfer. Provision has been made 
for the inclusion of a 1K-word ICL memory in the 
event that extensive inter-control unit communica- 
tion becomes necessary. 
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Each Input/Output Unit transmits control unit 
interrupt requests to the ICL. Three registers 
(one for each control unit) provide the means of 
transmitting an interrupt message to the control 
units with each interrupt request from either the 
CDC 7600 or the test and maintenance equipment. 


Error conditions within the PEPE signal the 
ICL to generate an error interrupt to the ACU. 
An error identification code is placed in an ICL 
register. 


The ACU SCL error-recovery software utilizes 
supervisory instructions which can read and write 
all control unit registers to and from the 
A-register in the ACU SCL. 


A Real-Time Clock (46 bits) and Interval 
Timer (24 bits) are contained in the ICL. Both 
count with 100ns granularity and are fully 
accessible from all control units. An ACU 
interrupt may be generated when: 


° The Interval Timer decrements to zero 

° The Real-Time Clock equals the value con- 
tained in the Real-Time Clock Buffer (fully 
accessible from all control units) 


Eight counters (24 bits) are available in 
the ICL for monitoring software/hardware perform- 
ance. 


Maintenance Control and Diagnostic Unit 
(MCDU). The MCDU is the diagnostic interface 
which couples the test and maintenance equipment, 


the PEPE control console, and the maintenance 
technician. 


Future Development 


Although the PEPE Program is continuing under 
the direction of ABMDA for the purpose of develop- 
ing an advanced ballistic missile defense system, 
nonmilitary applications for PEPE have been 
studied with the permission of ABMDA. These 
applications could include air traffic control, 
satellite tracking, auto traffic control and 
weather data processing. 
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OPERATING SYSTEM AND SUPPORT SOFTWARE FOR PEPE 


J.R. Dingeldine, H.G. Martin, W.M. Patterson 
Huntsville Operations 
System Development Corporation 
Huntsville, Alabama 35805 


Abstract -~ Software for the CDC 7600-PEPE 
configuration consists of a constructable real- 
time tactical process and the support software 
required to develop and execute the real-time 
process. This paper discusses: 1) the develop- 
ment and the real-time characteristics of the 
Operating System; 2) the procedure oriented lan- 
guage, Parallel FORTRAN (PFOR), used to develop 
tactical programs; and 3) the PFOR Translation 
System. A PEPE instruction level simulator and 
the process constructor are covered by other 


Operating System Software 


Figure 1 is a simplified picture of the PEPE 
system and its host. Three bi-directional 
communications paths connect the host computer 
(CDC 7600) with each of the PEPE controllers. The 
system is a network of controllers each of which 
requires compatible parts of the operating system. 
These interfacing parts, together with real-time 
executive functions, are the subject of this 
section. Comments are made on design goals, real- 


time executive functions, process execution 
control tables, and system performance under 
functional simulations. 


papers in this set. [3] [4] 
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Ballistic Missile Defense Agency (ABMDA), 
Huntsville, Ala., under Contract DAHC60-73-C-0060. 
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Fig. 2. Real-Time Executive Design Goals 


Following experience with the previous PEPE 
feasibility model and study of other existing and 
proposed multi-computer operating systems, a pre- 
liminary operating-system model was designed with 
the goals listed in Figure 2 in mind. A func- 
tional simulation model was generated to aid in 
evaluating the system's effectiveness. The 
results from the runs revealed excessive executive 
and interrupt handling times. Response to 
external stimuli was poor. The basic design was 
flexible enough and table driven as required, but 
the interrupt and overhead tasks were time 
consuming. 


A second simulation model was generated with 
emphasis on simplicity in the hope that flexibil- 
ity and responsiveness would follow. The current 
system is an outgrowth of the second model. 

Figure 3 is a flow chart of the real-time execu- 
tive loop. It has only three steps: (1) If there 
has been a change in the status of any condition, 
make any indicated enablements using process con- 
trol tables; (2) Select the highest priority task; 
(3) If a task is selected, clear the software 
interrupt flag, and call the task. When the task 
is completed, it returns control to the executive 
and the cycle is repeated. This basic executive 
cycle is supported by interrupt handlers and out- 
put routines which accomplish process control 
table changes when messages requiring actions are 
intercepted. 


Timing tests on CDC 7600 code for the execu- 
tive produced favorable results. Conditional 
enablements, step (1), were made in 4.620 micro- 
seconds using a three entry table. Task selection, 
step (2), required 4.950 microseconds, while task 
initiation, step (3), used 4.263 micro-seconds. 

So, a task may be running in the host within 10 
microseconds after it is enabled. This quick 
response is due to the simple structure of the 
process control tables, Figure 4. 
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Fig. 3. 
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Fig. 4. PEPE Process Control Tables 


The Task Enable Flags are arranged in prior- 
ity order. The first flag found in the On State 
represents the next task to run. The Software 
Interrupt Flag is set any time a task is enabled 
which has a higher priority than a running task. 
Any task which runs longer than the software 
interrupt interval (say 250 microseconds) must 
enable itself, adjust controls for continuing its 
operation later, and return control to the execu- 
tive. The status of conditions is maintained by 
the running tasks. The Conditions for Enablement 
Table has three parts per entry: a set of condi- 
tion states, a mask for selecting the set of | 
conditions, and task enabling flags (tasks are 
identified by flag positions as in the Task Enable 
Flags Table). All entries in the table are 
processed when a change in the status of condi- 
tions is detected. The Time Events Table is a 
chronological list of scheduled time events with 
associated periods between events and task 
enabling flags. The Task Description Table 
identifies the task name, size, and location in 
task priority (number) order. A buffer pointer 
leads the executing task to its input data. 


The process control tables are accessible to 
the running task, the executive, and all interrupt 
handlers. Tasks are triggered in other control- 
lers by messages in standard form. Time Event 
Change and Task Enable messages are completely 
processed by the message handlers. 


The simplicity of the real-time control 
process permits similar control mechanisms in all 
system controllers (see Figure 1). The capabili- 
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ties of the controllers dictate the amount of 
local executive control. The PEPE Arithmetic 
Controller has full interrupt features with an 
interval timer and error interrupts. It, there- 
fore, has full real-time executive controls. The 
other two PEPE controllers have only input 
interrupt, task initiation controls, and output 
handler features. 


It is interesting to note that with the rede- 
sign of PEPE to include program storage, the total 
system executes as a sequential computer network. 
The only operational difference is shorter 
execution times. Thus, changes to the host's 
commercial operating system are required only to 
support the input/output channels and for the 
addition of a real-time interval timer. 


The real-time controls as described accom- 
plish the original design goals (Figure 2) favor- 
ably. The simplicity is illustrated by the 
executive (Figure 3) itself. Responsiveness 
results from the simple requirement of the inter- 
rupt handlers. The time controls are maintained 
in time order to eliminate time consuming 
searches or sorts. Up to 48 tasks may be enabled 
by one condition table entry. The task triggering 
methods together with the rapid response to 
enablements permits efficient calls to scheduling 
algorithms or deadline functions. For example, a 
deadline task may be time enabled when the dead- 
lined task is scheduled. If the task executes 
before the deadline, it simply deletes the time 
table entry which would trigger the deadline 
action. The table structure obviously permits 
many process construction forms such as enable 
tasks, set time event, set/reset conditions, etc. 
Simulation model testing and actual instruction 
timings conducted at the ABMDA Research Center in 
Huntsville substantiate these statements. 


The simplicity of the real-time controls 
coupled with the ease of operating with the rede- 
signed PEPE appear to have produced an efficient 
and effective real-time system. 


Support System Software - Parallel FORTRAN (PFOR) 


Overview 


The Parallel FORTRAN (PFOR) sections of this 
paper emphasize language extensions and changes to 
the PFOR Translation System developed since the 
presentation of papers on the Parallel Element 
Processing Ensemble (PEPE) and its support soft- 
ware at COMPCON 72 and WESCON 72 [1], [2]. The 
referenced papers describe the basic PFOR lan- 
guage and language processors as implemented on 
the laboratory PEPE IC (Integrated Circuit) model 
used to demonstrate the feasibility of PEPE in a 
Ballistic Missile Defense (BMD) environment. 


PFOR Language 


PFOR is a procedure oriented, higher order 
FORTRAN—like language tailored to the new PEPE MSI 
(Medium Scale Integration) model. The language 
consists of: 1) PEPE FORTRAN, the minimal subset 
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of standard FORTRAN required for the sequential 
control of parallel algorithms in the PEPE Sequen- 
tial Control Logic (SCL) hardware, 2) Parallel 
FORTRAN or PFOR, the extensions to FORTRAN for the 
declaration of data and the parallel and associa- 
tive processing of data in the PEPE elements, and 
3) PEPE Assembly Language (PAL) machine instruc— 
tions, extended mnemonics, and pseudo operations. 


PFOR is currently being used to develop PEPE 
tactical processes. It is the sole source lan- 
guage for the three PEPE control units with unique 
machine code generated by the compiler for each 
control unit; whereas for the PEPE IC model, PFOR 
was used to program only the Arithmetic Control 
Unit (ACU). A macro assembly language (CUAL) was 
used to program the Correlation Control Unit (CCU) 
and sequential control was exercised in the host 
IBM S/360-65. The IC model hardware did not 
contain an Associative Output Control Unit (AOCU). 
The capability of intermixing PFOR, FORTRAN, and 
PAL statements in a source program has been 
retained. For the MSI model the PAL assembly lan- 
guage statements are bracketed by the PFOR 
primitives MODE(DIRECT) and MODE(PFOR). Each 
block of PAL code is processed as a single PFOR 
source statement. 


PEPE FORTRAN 


The FORTRAN declarative statements, impera- 
tive statements, and logical, relational, and 
arithmetic operators defined for sequential 
execution in PEPE are listed in Figures 5 and 6. 


The arithmetic operators * and / are not 
defined since the sequential portion of the PEPE 
hardware supports only 24-bit integer addition 
and subtraction. Address (16-bit) multiplication 
is implemented in the sequential hardware which 
allows the compiler to generate efficient code 
for array references which contain variables in 
the array subscript. 


A minimal subset of standard FORTRAN required 
to exercise sequential control of parallel, 
tactical processes has been defined for PEPE. 


FORTRAN REAL variables are limited in use 
since the SCL hardware does not perform floating 
point operations. These variables are used dur- 
ing data transfer between the host and the 
ensemble. 


PFOR Extensions 


The PFOR language has previously been 
described by Wilson [1] and Cornell [2] at 
COMPCON 72 and WESCON 72. The basic PEPE IC 
model PFOR primitives and operators which have 
been retained for implementation on the MSI model 
are listed in Figure 7. 


Several features have been added to the PFOR 
language for the PEPE MSI model. To support 
parallel double precision integer arithmetic, the 
data description forms PAR DOUBLE (element memory 
double word) and PAR COR DOUBLE (correlation 
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PEPE FORTRAN Subset Declaratives 


CALL 


CONTINUE 


DO 


GOTO (UNCONDITIONAL, COMPUTED) 


IF (LOGICAL, ARITHMETIC) 


RETURN 


STOP 


-AND., .OR., .NOT. 
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Fig. 6, 


PEPE FORTRAN Subset Imperatives and Operators 
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Fig. 7. Basic PFOR Primitives and Operators 


register file double word) have been added. For 


example, 
PAR DOUBLE PV 


declares PV to be a double precision, integer 
parallel variable. In all examples in this paper, 
names prefixed by the letter P denote parallel 
variables. 


The set of PFOR statements which allow order- 
ed selection of PEPE elements in sequential, 
ascending, and descending fashion (DO SEQ, DO ASC, 
and DO DESC) select the elements one at a time. 
These have been augmented by the DO UP and DO DOWN 
constructs which allow sets of elements to be 
utilized in an ascending or descending manner. 

The code sequence 


WHERE (PTEST) 100 
DO UP 100 (PV) 10,1 
100 PW=1I 


acts like a DO ASC statement if, in the elements 
passing the test (the set of elements remaining 
active by virtue of the parallel logical variable 
PTEST being true), the ascending algebraic values 
of PV are unique. In this case in the active 
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element where PV has the lowest algebraic value, 
PW is set to one; in the active element where PV 
has the next lowest value, PW is set equal to two, 
etc. However, if PV is duplicated in one or more 
of the elements passing the test, a "tie" exists. 
Assume for example the two lowest algebraic values 
of PV are identical in the active set of elements. 
The DO ASC construct causes PW to be set equal to 
one in the first physically available active 
element of the two. PW is set to two in the other 
element and then looping continues with PW = 3, 

PW = 4, etc., until at most ten elements are 
looped over. Upon completion of the DO ASC loop, 
I is set to indicate the number of elements 
involved in the processing (i.e., elements where 
PW has been set) for there may be less than ten 
elements which passed the test. 


For the DO UP construct, PW is set to one for 
both elements where the lowest value of PV is 
identical; then PW is set to two in the active 
element where PV has the next lowest value, etc.; 
until at most ten sets of elements are looped 
over. Upon completion of the loop, I is set to 
indicate the number of sets of elements involved 
in the processing (sequentially tagged in the 
example) for there may be less than ten sets of 
elements which passed the test. 


The WHERE class of statements (WHERE, WHERE 
MAX, WHERE MIN), which are used to specify a 
content-addressed subset of PEPE elements, have 
been augmented by the addition of WHERE FIRST, 
WHERE NOT, WHERE SET, and CONVERGE constructs. 
The general form 

WHERE (logical attribute) s# 

causes the subset of elements satisfying the 
given attribute to remain active and to partici- 
pate in processing through the range of the 
statement labeled s#. 


The WHERE class of statements illustrates 
the explicit associative aspects and the implicit 
parallel aspects of the PFOR language. The 
attribute parameter explicitly denotes the asso- 
ciative or content based addressing. The 
parallel execution of the statements in the range 
of the WHERE statement (to and including the 
statement labeled s#) is implicit. 


The statements in the range of a WHERE FIRST 
statement (through s#) are executed only in the 
first physically available element of the active 
set. No attribute is specified. For example, 
suppose we wanted to calculate (PZ)“ in one and 
only one element where PX is greater than PY. 
This could be realized by the program sequence 

WHERE (PX .GT. PY) 10 

WHERE FIRST 10 
10 PZ = PZ * PZ 
which calculates a new value of PZ in the first 
physically available element of the subset of 
elements which pass the test. 
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A WHERE NOT statement must reside within the 
range of a WHERE, WHERE MAX, or WHERE MIN state- 
ment and does not require specification of an 
attribute. The statements in the range of a 
WHERE NOT statement (through s#) are executed only 
in those elements made inactive by the preceding 
WHERE statement. Upon execution of the range 
terminating statement (labeled s#), the element 
activity reverts to the set of elements active by 
virtue of the preceding WHERE-type statement. 
Assuming a 288-element ensemble where 100 elements 
are active at the time the simple WHERE statement 
is executed and 75 elements of the 100 pass the 
test in the simple WHERE statement, the program 
sequence 


WHERE (PX .GT. PY) 10 
PFLAG = 1 
WHERE NOT 5 

b) PZ = PZ +1 

10 PZ = PZ * PZ 


sets PFLAG equal to one and PZ = (Pz)2 in the set 
of 75 active elements where PX is greater than PY 
and sets PZ equal to PZ plus one in the set of 25 
elements made inactive by virtue of PY>PX. 


The WHERE SET construct allows the user to 
temporarily activate a set of elements. This may 
expand (or contract) the set of active elements 
as opposed to the typical nesting of WHERE state- 
ments which subset elements into smaller sets and 
reinstate the element activity level by level as 
the program reverts from inner levels to outer 
levels. Assuming a 288-element ensemble where 
100 elements are active, 75 remain active by 
virtue of passing the test in the simple WHERE 
statement, and PW is greater than zero in 250 of 
the 288 elements in the ensemble, the program 
sequence 


WHERE (PX .GT. PY) 10 
PZ = PZ * PZ +1 


WHERE SET (PW .GT. 0) 5 
5 PX = 0 

PY =1 
10 CONTINUE 


computes a new value of PZ in the 75 elements of 
the set of 100 where PX is algebraicly greater 
than PY, sets PX equal to zero in all elements of 
the ensemble where PW is greater than zero, sets 
PY equal to one in the 75 elements of the set of 
100, and then reverts the ensemble activity back 
so that the original 100 are active. The normal 
nesting of WHERE statements allows subsetting of 
element activity and, when the range terminator 
statement labeled s# has been executed, the 
previously active set of elements becomes 
reactivated. 


A simpler, faster-executing content-address- 
able method of subsetting element activity has 
been implemented for CCU-targeted program units in 
the form of the CONVERGE construct. The CONVERGE 
statement is used for typical correlation algo- 
rithms which involve short iterations of code. 
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The general form of the CONVERGE statement is 


CONVERGE (logical attribute) s# 
CONVERGE statements may be nested but the range 
terminator statement labeled s# must be identical 
for each CONVERGE statement in the nested set. 
The element activity reverts to its previous 
state once for each set of nested CONVERGE state- 
ments; whereas, for WHERE-type statements the 
element activity is restored following the range 
terminator statement of each WHERE in the nested 
set. A set of nested WHERE statements may also 
have the same range terminator statement label 
but, for each nested WHERE, code must be 
generated to revert the element activity step by 
step from inner level to outer level. The 
CONVERGE statement executes faster only when 
nested with other CONVERGE statements. The 
element subsetting of CONVERGE and WHERE state- 
ments can be illustrated by the following code 
sequences where the numbers in parentheses 
indicate the number of active elements: 


(100) (100) 
CONVERGE(PY .GT. PX) 15 WHERE(PY .GT. PX) 15 
(50) (50) 
CONVERGE(PM .LE. PN) 15 WHERE(PM .LE. PN) 10 
(30) (30) 
15 CONTINUE 10 CONTINUE 
(100) (50) 
15 CONTINUE 
(100) 


The INHIBIT INTERRUPT and ALLOW INTERRUPT 
constructs are used to bracket short, fast-execut- 
ing code sequences in ACU-targeted program units 
to inhibit Host, CCU, and AOCU interrupts. This 
feature has been implemented because PFOR programs 
are not re-entrant. 


The MODE (processor) and MODE OFF (processor) 
constructs are used to bracket subordinate code 
sequences targeted for execution in another 
processor. The WHILE construct is used to specify 
a primary code sequence which is to continue 
execution in an overlap mode while the subordinate 
code sequence executes in another processor. In, 
for example, a CCU-targeted program unit, the 
program sequence 


MODE (ACU) 
PV = PX + PY 
MODE OFF (ACU) 


causes the ACU to be interrupted, the activity 
state of the CUs (Correlation Units) to be trans- 
ferred to the AUs (Arithmetic Units), and the 
bracketed code sequence to be executed by the 

ACU under the control of the ACU resident Real- 
Time Executive Interrupt Handler. Also, in a CCU- 
targeted source program, the program sequence 


MODE (ACU) 

PV = PX + PY 
WHILE 

CALL CORLAT 
MODE OFF (ACU) 
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causes the ACU to be interrupted and the code 
sequence bracketed by MODE (ACU) and WHILE to be 
executed in the ACU. In an overlap fashion CCU 
resident subroutine CORLAT is called and executed. 
The statement following the MODE OFF (ACU) state- 
ment is executed only after both the ACU code 
sequence (bracketed by the MODE (ACU) and WHILE 
Statements) and the CCU code sequence (bracketed 
by the WHILE and MODE OFF (ACU) statements) have 
been executed to completion. 


The function PABS is used to obtain the abso- 
lute value of a parallel expression. The PEPESTAT 
construct allows the programmer to transfer the 
status of element activity from one control unit 
to another via the shared element memory and 
allows an ACU or AOCU targeted program unit to 
extend element subsetting beyond the normal limit 
of 21 levels. 


| The READ, WRITE, and WRITE-with-End-of-Record 
(WRITER) constructs are used to pass data between 
the CDC 7600 host and the PEPE SCL data memory. 
These constructs are PEPE oriented in that 1) no 
FORTRAN FORMAT capability is required, 2) data 
are converted by the hardware from CDC format to 
PEPE format or vice versa as specified in a PEPE 
resident control word referenced by the PEPE 
I/O machine instructions, and 3) data conversions 
are performed based on the PFOR type specification 
Statements for the variables. Figure 8 lists the 
extensions to PFOR being implemented for the PEPE 
MSI model. 


PAR DOUBLE 
PAR COR DOUBLE 


D0 { 


DOWN 


UP (ARITHMETIC ATTRIBUTE) 


FIRST 
NOT 
SET 


NO ATTRIBUTE REQUIRED 
NO ATTRIBUTE REQUIRED 
(LOGICAL ATTRIBUTE) 


CONVERGE (LOGICAL ATTRIBUTE) 


INHIBIT INTERRUPT 
ALLOW INTERRUPT 


MODE DIRECT, MODE PFOR 
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READ, WRITE, WRITER 


Fig. 8. PFOR Extensions for PEPE MSI Model 
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Additional features incorporated in the PEPE 
MSI model PFOR language allow ACU and AOCU 
targeted subroutines containing parallel state- 
ments to be nested to any level. Also, PFOR 
statements may be extended as continuation lines 
in the FORTRAN fashion. 


The advantages of the PFOR language are that 
it is easy to learn (like FORTRAN), the associa- 
tive aspects are explicit, the parallel aspects 
are implicit, and it is used as a common source 
language for the three PEPE processors. 


Minor constraints include certain limitations 
on usage; i.e., some language forms cannot be 
utilized in all three processors because each 
processor has unique hardware designed for its 
particular application (input, processing, output). 
Moreover, the utilization of mixed parallel and 
sequential forms in a single source program is 
sometimes a source of confusion to programmers 
oriented towards sequential processing computer 
systems. | 


PFOR Translation System 
Background 


The PFOR language translation system for the 
laboratory PEPE IC model resided on the IBM 
§/360-65. It consisted of 1) PFOR Monitor, 

2) PFOR precompiler, 3) PAL assembler, 4) S/360 
FORTRAN compiler, and 5) S/360 assembler. The 
PFOR precompiler was a preprocessor which con- 
verted the PFOR source language to FORTRAN for 
execution in the host and PAL for execution in 
PEPE. FORTRAN source text was passed without 
error checking except to ensure that statement 
label references (GOTO,DO, etc.) did not conflict 
with PFOR construct context rules. The PFOR 
language translation system has been described in 
detail by Wilson [1]. Following is a brief over- 
view. The PFOR preprocessor converted PFOR source 
text to FORTRAN and PAL. FORTRAN and PAL source 
text were not modified. The PAL assembler con- 
verted blocks of PAL code to S/360 assembly lan- 
guage named data sets and generated a FORTRAN call 
Statement to a run time PEPE initiator routine 
(PINIT) for each named data set. The intermixed 
input and generated FORTRAN statements were passed 
to the S/360 FORTRAN compiler and the generated 
assembly language CSECT and DC pseudo operations 
representing the named data sets containing the 
blocks of PAL code were passed to the $/360 assen- 
bler. At run time, under control of the code 
segments executing in the host, blocks of PEPE 
instructions were streamed over a selector channel 
to the ACU for execution and PEPE data were 
returned to the host via the same channel follow- 
ing host invocation of the PINIT interface routine. 
The stream of instructions received by the ACU 
were decoded and broadcast one by one for simul- 
taneous execution in the ensemble Arithmetic Units 
(AUs). The PFOR translation system processed only 
PEPE code destined for ACU execution. A separate 
Correlation Unit Assembly Language (CUAL) imple- 
mented as S/360 assembler language macros was used 
to generate blocks of PEPE code (as S/360 named 


1973 SAGAMORE COMPUTER CONFERENCE ON PARALLEL PROCESSING 


data sets) for transmission over another selector 
channel to the CCU. 


Current Implementation Design 


The PFOR language translation system for the 
PEPE MSI model is designed to execute on the 
CDC 7600 under the control of SCOPE 2.0. A common 
source language for the three PEPE control units 
is accepted, and the source text is converted to 
a triple object language, namely, unique parallel 
and sequential code for each target controller 
(ACU, CCU, and AOCU). 


The PFOR compiler consists of two passes 
operating under the control of the PFOR monitor. 
The monitor performs control statement cracking 
to determine, for example, if the program unit (or 
batch of program units) is destined for execution 
in the ACU, CCU, or AOCU. Pass 1 of the compiler 
contains the first pass of a conventional two-pass 
assembler. Pass 1 also performs syntax analysis 
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Fig. 9. PFOR Translation System 


17/ 


and grammar checking on the source text. A source 
listing, error diagnostics and a PEPE memory map 
are output to a list file. A file containing an 
encoded pseudo binary unit record for each assem- 
bly language statement (PAL source input or 
generated PAL) and a dictionary or symbol table 
file are prepared by Pass 1 for input to Pass 2, 
the assembly pass. The assembler generates an 
object listing and a cross reference listing of 
symbol utilization. A relocatable binary object 
module is generated which can be placed in a 
library file or directly input to the Process 
Consolidator for linkage and binding into a core 
image absolute binary load module suitable for 
loading (from the CDC 7600) and execution in PEPE. 


ACU-executing code sequences embedded in a 
CCU or AOCU targeted source program (parent pro- 
gram) are placed on disk by Pass 1. Parallel 
variable entries in the symbol table are saved in 
compiler memory. When the compilation of the 
parent program is complete, each subordinate code 
sequence destined for ACU execution is processed 
as a separate, unique compilation utilizing the 
saved symbol table containing entries for parallel 
variables which reside in element memory. In PEPE 
a single element memory is shared by the three 
control units so parallel variables can be declar- 
ed and referenced in both parent programs and 
subordinate code sequences. An overview of the 
PFOR Translation System is depicted in Figure 9. 


SUBORDINATE 
ACU PROGRAM 
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Summary and Conclusions 


The PEPE MSI model PFOR compiler accepts a 
single source language and generates a triple 
object language. Language forms have been 
expanded to provide more flexibility to the 
tactical applications programmer. Using PEPE, an 
increase in data volume (as opposed to an increase 
in the complexity or sophistication of data 
manipulation) can be straightforwardly handled by 
increasing the number of elements in the ensemble. 
The PEPE software will still perform effectively 
with no changes; i.e., an increase in system 
capability obtained by adding more hardware is not 
necessarily accompanied by software breakage 
problems. This latter point is illustrated by the 
fact that in the PEPE IC model, tactical software 
for tracking targets was checked out using a 16- 
element simulator, run on the 16-element IC model 
hardware, and then transferred to a 100-element 
simulated ensemble with no changes. 
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PROCESS-CONSTRUCTION FOR A PARALLEL-SEQUENTIAL COMPUTER ARCHITECTURE [1] 
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Summary 


The purpose of process construction is to 
facilitate the transition from process design to 
operating process. Five successive states com- 
prise this transition: modification, translation, 
compilation, consolidation, and operation. 


The PEPE process constructor currently 
excludes the modification stage; it is performed 
by a commercial utility routine. Statements of 
design and implementation are updated and sorted 
to produce a file of process definitions and a 
file of source statements for the components of 
the object process. These are inputs to the 
translation stage. 


Definitions are translated first to produce 
object statements defining the process data base 
and to provide information for the translation 
routines to use in handling the operative state- 
ments. The latter are translated as they are 
detected in the ensuing examination of the source 
file. Unrecognizable statements in that file are 
passed in proper sequence to either the PFOR [2] 
task file or the FORTRAN task file. This stage 
also produces a control file for the consolida- 
tion stage. 


The process constructor invokes each of the 
compilers to produce a file of object modules; 
one for the PEPE and one for the host. The 
consolidation stage reads these and other files, 
such as the system subroutine libraries. Direc- 
tives to the consolidator from the translator and 
information included in the object modules enable 
this stage to create modules to be loaded into 
each memory of the PEPE-host configuration and 
operation is begun. 


The approach to tactical software develop- 
ment in the PEPE program is one of evolution from 
process design and functional simulation to live 
operation. As an aid to this approach the con- 
structor uses the Software Development Language 
(SDL). This language meets the requirements of 
flexibility and ease of use through its syntax; 
keyword followed by parameter. Statements are 
formed from sequences of keyword-parameter sets. 
Three field delimiters each have the same meaning 
and are used interchangeably for readability. 

The simple, rigorous syntax enables the table- 
driven SDL translator to be highly generalized, 
thus the language is open-ended, requiring only 
changes or additions to the data in the transla- 
tor's control tables for modification or addition 
of a statement to the language. 


A tactical process design is described in 
SDL statements and such PFOR and FORTRAN state- 
ments as are needed to manipulate data for a 
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functional simulation. The process is then con- 
structed for simulation, merging the simulation 
package with the process. As the various routines 
of the process are implemented, their code is 
added directly to the source library. The code 
thus becomes part of the process and is executed 
during operation, but it does not affect the 
functional simulation. When the tactical code is 
fully implemented the process can be constructed 
for live operation. The translator recognizes 
statements that are peculiar to a simulation and 
removes them or, in some cases, provides a trans- 
lation more appropriate to a live process. 


The basic data entities are PARCELs (parallel 
cells grouped into PARTITIONs) in PEPE ensemble 
memories, and ELEMENTs. Both translate into 
variables and arrays of up to three dimensions, 
aside from the PARCEL's innate vector across the 
ensemble. The PARCEL also may be positioned and 
packed in bit groups smaller than word size with 
more than one parcel per word. ELEMENTs comprise 
QUEUEs in host secondary storage, accessed by a 
data manager utility. They also appear as members 
of SETs, translating into labelled common in host 
primary memory and the data memories of the 
respective PEPE Control Units. 


To date the application of process construc- 
tion methods directly to the PEPE-related segments 
of the object processes to be built is extremely 
limited. Because of this it is not clear to what 
extent such methods will aid implementation of 
parallel processes. Also there are capabilities 
now available, or soon to be available, that will 
allow the description of a process to be stated in 
"neutral" terms; neither specifically sequential 
nor parallel. Then, via directives to the process 
constructor the designer can alter the distribu- 
tion of data and functions over the parallel- 
sequential architecture to determine the optimal 
assignment of functions. 
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Abstract -- The mapping of an existing large 
real-time application onto PEPE is discussed. A 
set of measures describing the utilization of 
hardware by software and the match of the combi- 
nation of hardware and software to the problem 
are suggested. The tools used to gather data for 
both the serial and the parallel implementation 
are described. Preliminary data is presented. 


Introduction 


SETS[1] is a computer program currently im- 
plemented on a Control Data Corporation 7600 con- 
sisting of approximately 150 modules and 120,000 
machine instructions. It is designed to execute 
in real time and to model the complete environ- 
ment external to a tactical data processor in a 
ballistic missile defense scenario. The primary 
task of SETS is to generate realistic radar re- 
turns in response to radar commands which are 
communicated through an interface with another 
computer. The salient characteristics of this 
problem are: (1) there is a large, time-varying 
data base describing the changing environment 
which must be input and maintained; (2) as many 
as 5000 radar commands per second may cross the 
interface; (3) the amount of processing required 
to generate a return is highly dependent upon the 
changing environment; and (4) average response 
times as short as 200 microseconds are required. 
This is a large real time problem characterized 
by high data rates, a dynamically changing data 
base, unpredictable computational requirements, 
and short response times. Some of these charac- 
teristics are common to other real time problems 
(e.g., Air Traffic Control and command and con- 
trol systems). This particular application can 
be considered, in general terms, as a problem 
which requires that a dynamic data base be main- 
tained and that the data base be accessed to 
respond, in real time, to questions related to 
that data base. 


In an effort to extend the capacity and 
fidelity of the simulation, parts of the simula- 
tion are being implemented on PEPE, which will 
serve as an adjunct to the CDC 7600. The result- 
ing configuration will be one in which PEPE and 
the CDC 7600 cooperate to respond to an inquiry. 


The following two sections of the paper will 
summarize the current serial implementation and 
describe the mapping of the problem onto PEPE. 
The final sections will present some preliminary 
thoughts and data describing the performance of 
each implementation. 
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Control Structures 


The reader's familiarity with the PEPE archi- 
tecture and nomenclature is assumed. The preced- 
ing papers in this session should provide suffi- 
cient background and references. A knowledge of 
the CDC 7600 architecture is needed to understand 
some of the measurement data. 


The Serial Implementation 


Serial implementation of this problem have 
been designed, executed, measured and refined for 
three years [2,3]. In the current version, a 
single inquiry is processed to completion before 
starting work on the next. Although off-line pre- 
processing is used to increase throughput, no 
attempt is made to anticipate an inquiry. The 
relevant parts of the data base are updated as 
part of the inquiry processing. 


Data Structures | 


The basic data structure used is the linked 
list. This was chosen because of the varying 
storage requirements of different scenarios and 
the logical interconnections of the data. 


The process is data driven by the presence 
of new inquiries. Computations are initiated 
under the control of a time ordered task list. 
Input/output interrupts are transparent to the 
applications code. | 


Input/Output Structures 


The data base is double buffered into Large 
Core Memory (LCM) at approximately one second | 
intervals. The data is then transferred into 
Small Core Memory (SCM) as needed by the applica- 
tions program. References to data are made | 
through a Dynamic Storage Allocation system (DSA) 
[4] providing memory management that is trans- 
parent to the user. Inquiries and responses re- 
side in circular buffers. The management of | 
these buffers and the interface between computers 
is performed by special purpose interrupt hand- | 
lers and Peripheral Processor Unit (PPU) routines 
transparent to the applications code. | 


This research was supported by the Advanced Bal- 
listic Missile Defense Agency under contract 
DAHC60-73-C-0037. 
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The Parallel Implementation 


The parallel implementation consists basi- 
cally of two tasks: (1) responding to inquiries 
and (2) updating the data base. The division is 
in contrast to the serial implementation where 
the relevant portions of the data base are up- 
dated in response to each inquiry. These tasks 
have several subtasks. The PEPE implementation 
requires that the subtasks be assigned to the 
PEPE units so as to (1) maximize the number of 
simultaneously active instruction streams, (2) 
have access to an instruction set which is suited 
for the subtask, and (3) maximize the number of 
distinct data streams for parallel subtasks. 


Description of the Major Subtask Distribution 
Among Units 


Input of the data base is performed by the 
ACU/AU and the data is primarily stored in ele- 
ment memory. This choice was dictated by the 
need for coordination between data base input and 
data base updates. 


Inquiries are input into a circular buffer 
in the CCU data memory (SCDATA) under the control 
of the CDC 7600 or a special interface computer. 
The CCU transfers this data from SCDATA into the 
appropriate Element Memories and handles the 
associated bookkeeping. This assignment was 
based upon the estimated utilization of the units 
which indicated that the CCU would be under- 
utilized and thus available for this essentially 
serial process. 


Decoding of the inquiries, selecting rele- 
vant parts of the data base and linear data base 
updates are performed by the ACU/AU. The ACU/AU 
is assigned to this task by an interrupt from the 
CCU/CU. These functions would probably have been 
assigned to the CCU/CU if fast shift instructions 
and a fixed point multiply were available. 


The generation of responses and output func- 
tions are assigned to the AOCU/AOU. The gener- 
ation function is primarily fixed point arith- 
metic and associative computations. Outputting 
of the data requires serially moving data from 
element memory to the AOCU data memory (SODATA) 
and its subsequent transfer to the CDC 7600. 

The high speed AOCU data and program memories 
appear well suited to these functions. The lack 
of a fixed point multiply might become signifi- 
cant in the future, if the generation function 
becomes more complex. 


Data base maintenance is a parallel numeri- 
cal task which is a background ACU/AU operation. 
This task keeps the data base current enough to 
allow rapid generation of responses. 


Data Structures 


The data structures are fixed arrays and 
circular buffers. The associative properties of 
PEPE and the natural partitioning derived by 
asSigning data to different elements within the 
ensemble obviates the need for a software equiv- 
alent of linked lists. 
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Control Structures 


The CCU is data driven by the presence of 
inquiries in SCDATA. Their presence is detected 
by the application software. The AOCU is also 
data driven by the presence of decoded inquiries 
in element memory. The ACU participates in the 
generation of a response to an inquiry when it is 
interrupted by the CCU. In addition, the ACU 
periodically inputs data base information and up- 
dates this information as a background process. 


Comparison of the Serial and 
Parallel Implementation 


Introduction 


If both machines were executing functionally 
identical software, it would be meaningful to 
compare the execution times for representative 
sets of input data. This is not possible at the 
present time. Also, the approach yields little 
insight about the relationship of the hardware, 
the software, and the problem. We propose to 
discuss some preliminary attempts to determine, 
for each of the two machines, how well the soft- 
ware is matched to the hardware, and how well the 
combination of hardware and software is matched 
to the problem. 


In the following paragraphs we will describe 
these comparisons. We will then outline the tools 
which are available to gather the data. Finally, 
we will present the preliminary data that is 
available. 


Basis of Comparison 


In principle, the applications software may 
be dichotomized into those pieces of code which 
are performing the computations (arithmetic and 
logical) specified by the functional description 
of the problem (call it problem code) and those 
pieces performing such functions as controlling 
the flow of the computational process, accessing 
data, and maintaining data structures (support 
code). Execution times will be influenced by 
both software design and the machine architecture 
that the software runs on. The ratio of problem 
code execution time to the total execution time 
(problem code plus support code) is a measure of 
the match between the hardware and software and 
the original problem. More generally stated, 
this is a measure of the resources required by a 
set of algorithms divided by the resources of 
the problem solution in which the algorithms are 
embedded. 


Although we cannot give a precise quantifi- 
cation of this measure, some of the available 
data does provide a basis for an initial estimate. 
In the serial implementation, most of the data 
access, data transfer, and data structure mainte- 
nance functions are performed via FORTRAN sub- 
routine calls to a Dynamic Storage Allocation 
system (DSA) and thus are identifiable. There 
are two limitations to using DSA execution time 
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as a measure of the support code time. First, 
many of the variables used in the arithmetic com- 
putations are singly or doubly indexed. The time 
necessary to perform the index arithmetic should 
be included in the data access time, but is not 
included in the measure of DSA. The second limi- 
tation is that some of the logic of the computa- 
tions is simplified by the data structure. This 
time, which is included in the DSA total, should 
be charged to problem code. These two omissions 
bias the answer in opposite ways and thus, for a 
zero order approximation, can be ignored. 


In our particular code--the skeletal PEPE 
SETS code--the problem code is executed in the 
ensemble and most of the support code is executed 
in the sequential control logic (SCL). The SCL 
code is controlling the process, is transferring 
data from Element Memory into the control unit 
data memories, and is performing address arith- 
metic. The parallel support code is primarily 
concerned with maintaining the Activity Stack 
and inputting inquiries. 


A commonly used measure of the match of 
software to computer architecture is the amount 
of parallelism actually achieved, compared to the 
amount of parallelism inherent in the hardware. 
(For example, a system profile obtained with a 
hardware monitor is often used for this purpose.) 
We will consider this in the context of PEPE and 
the CDC 7600 central processing unit. 


In the 7600 there is the potential for in- 
struction fetch and execution overlap, and simul- 
taneous execution of multiple instructions. The 
latter is accomplished with multiple functional 
units, most of which are segmented for pipelined 
operation. The maximum execution rate is one 
instruction each cycle. 


Three types of parallelism must be consid- 
ered for PEPE: the simultaneity of instruction 
streams, the presence of multiple, independent 
data streams, and the overlap of instruction 
fetch, routing, and execution. Many of the PEPE 
instructions execute in one or two clock periods. 
The degree to which instruction fetching, routing, 
and execution are overlapped can affect the exe- 
cution rate by 100% or more. 


Measurement tools have been developed and 
are being used to gather data describing the 
achieved parallelism on each of the computers. 


Tools 


The task of comparing the two implementa-— 
tions is just beginning. The data is sparse, 
and the conclusions tentative. There is a simu- 
lation of the 7600 [5] to evaluate the serial 
implementation. A software monitor package [6] 
exists to gather timing and execution path data 
for 7600 programs. As a check on the 7600 simu- 
lator there is timing and dynamic instruction 
mix data gathered with a hardware monitor [7] 
using a CDC 6400 executing a non-real time ver- 
sion of the application program. 


182 


The CDC. 7600 simulator (SIM7600) is a pro- 
gram developed by General Research Corporation 
which simulates hardware functions of the CDC 
7600 at a clock cycle level. The SIM7/7600 program 
is executed as an ordinary batch job under the 
control of the operating system on either CDC 
6000 or 7000 series equipment. SIM7600 models 
the Central Processor Unit (CPU), the first level 
Peripheral Processors Unit (PPU), the Maintenance 
Control Unit (MCU), a variety of external equip- 
ment, and the connecting communication channels. 


The CDC 7600 software monitor instruments 
object code to record the sequence of entries, 
exists and execution times of selected program 
modules. Reports are then generated describing 
the module characteristics and the relationships 
between modules. 


The hardware monitor experiments investi- 
gated the characteristics of selected programs 
running on a CDC 6400 computer system. The goals 
of the study were to evaluate the use of hardware 
monitors for measuring the performance of real 
time computer systems, and to investigate the 
characteristic use of the CDC 6400 by the SETS 
program. 


A PEPE computer simulation has been develop- 
ed in order to aid the design and testing of pro- 
gram code, to provide insight into the operation 
and interaction of the various control and compu- 
tational elements of the system, and to establish 
preliminary timing estimates for the algorithms 
which are being developed. The simulator repre- 
sents current PEPE specifications [8] and con- 
tains all of the salient characteristics of the 
real hardware design. 


The micro-code sequences were not modeled 
for each instruction algorithm. However, register 
contents, control signal values, execution delays 
and overall timing have been faithfully observed. 
Since the time-base of the simulation clock has 
the same granularity as the clock period in the 
PEPE system (100 ns), substantial data are avail- 
able for collection and evaluation. The follow- 
ing data are currently being collected: 


Clock cycle of instruction issue to each of 
the sequential and parallel units (relative 
to the beginning of simulation). 


Element activity counts at each clock cycle. 


Total of instruction issues for each unit 
(sequential and parallel). 


Distribution of issued instructions by 
major (high order 5-bit) instruction cate- 


gory. 


Count of references to element memory for 
each unit (AOU, AU, CU). 


Count of the cycles of concurrent execution 
for parallel and sequential unit combina- 
tions (AOCU/AOU, ACU/AU, CCU/CU). 
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Accumulated cycles of concurrent idle time 
for sequential and parallel unit combina- 
tions (AOCU/AOU, ACU/AU, CCU/CU). 


Calculated effective instruction rate. 


In addition, a detailed trace facility has been 
incorporated to provide register contents, con- 
trol signal status and element activity on a 
clock cycle basis. The trace facility may be 
enabled or disabled under object program control 
by utilizing one of the unused test and mainte- 
nance instructions. 


In order to facilitate writing object code 
to be measured with the simulation tool described 
above, a cross assembler was developed to trans- 
late PEPE mnemonic instruction formats to appro- 
priate bit-field definitions. The assembler 
permits data as well as instructions to be gener- 
ated for any of the global (program or data) 
memories and, in addition, permits data to be 
preset into element memory. The cross assembler 
is implemented with the COMPASS assembly language 
of the CDC 6000-7000 computer systems. It allows 
the use of all of the standard features of the 
COMPASS assembler. 


Serial Implementation Data 


Using the software monitor we measured the 
elapsed central processor time, operating system 
services time, wall clock time, and the number of 
invocations for each module and major sequence of 
modules in the SETS code. Only the central pro- 
cessor time is considered for two limiting cases 
to derive the ratio of support code time to total 
execution time. The first case was one in which 
the computations required to generate a response 
to an inquiry were minimal, in the other case 
the number of computations were maximized. The 
support code data, presented in Figure 1, is the 
sum of the time spent in the data access and data 
structure maintenance routines (primarily DSA) 
and the time spent in routines which transfer 
data within the central processor memory systems. 
The high support code values for the minimum case 
is indicative of the amount of data handling 
activities performed in processing an inquiry 
independent of the complexity of the response. 


Figure 2 presents some preliminary measure—- 
ments which describe the parallelism achieved by 
the CDC 7600 CPU for a particular execution of 
the SETS program. The SETS program was in this 
case responding to a typical inquiry. This data 
was derived by running SETS with the CDC 7600 
simulation. 


The measures used are millions of instruc- 
tions issued per second (MIPS), the fraction of 
cycles waiting to issue the next instruction, and 
the ratio of execution time to the execution time 
of an equivalent serial instruction stream com- 
puted by summing the execution time of all in- 
structions issued. Instruction issue is delayed 
by contentions for registers, certain functional 
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units, and the unavailability of an operand or 
instruction word being fetched from memory. 


The CDC 7600 has a maximum instruction issue 
rate of one instruction every 27.5 nanoseconds. 
On this basis the SETS code issues instructions 
at approximately one-third the maximum rate. In 
the sample code, almost 15% of the total CPU time 
was spent initiating the fetch of instruction 
words from memory. In an additional 47% of the 
machine cycles no instruction issue occurred due 
to the delays mentioned in the preceding para- 
graph. Although it appears that much of this 
delay time is spent waiting for operands to be 
fetched from memory, more work is required to 
fully understand the mechanisms. It is expected 
that the CDC 7600 simulator will supply the data 
necessary to illuminate the causes of the delays. 


Parallel Implementation Data 


In the serial implementation the majority of 
the code is concerned with small pieces of logic 
arising from the need to consider a wide range of 
input scenarios. This diversity is lacking in the 
current PEPE code. We believe that these omis-— 
sions cause the data base update to have more 
simultaneous data streams (active elements) and 
simpler data base update algorithms than would 
exist in the complete code. The complete code 
will probably require some additional associative 
operations to select the relevant portions of the 
data base for a given inquiry. In addition, the 
complete code will probably include a background 
process in the CCU/CU as part of the data base 
maintenance process. It is our guarded belief 
that the skeletal version does accurately repre- 
sent the degree of interaction between the units. 


The support code/problem code data for one 
case was derived from the instruction trace out- 
put of the PEPE simulator. The input for this 
example consisted of only one inquiry. The re- 
sult is that the AOCU and CCU spend most of the 
time waiting for the arrival of data. In the 
full PEPE SETS there would be background tasks 
assigned to these units as well as a steady flow 
of inquiries. The effect of the limited inquiry 
data is to bias the execution towards a high per- 
centage of support code. The support code con- 
sumed 44.8% of the total execution time. 


Figure 3 presents a summary of the execution 
characteristics of a 100 microsecond time inter- 
val which included the complete processing of an 
inquiry. Polling for new inquiries and data base 
maintenance were background processes. The effec- 
tive instruction rate is the sum of the average 
number of instructions issued each second to the 
three bodies of Sequential Control Logic (SCL) 
and the three Parallel Instruction Control Units 
(PICU). Thus any backlog remaining in the Paral- 
lel Instruction Queue (PIQ) at the termination of 
a run is not included in the total. 


Regarding instruction issues as a measure of 
execution rate bypasses the problem of scaling 
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SERIAL IMPLEMENTATION 
SUPPORT CODE TIMING 
(Percent of Total CPU Execution) 


Maximum 
Response 


Computation 


Minimum 
Response 


Computation 


Data Acess 


and 
Structuring 17.2 5.5 
(DSA) 
Data 
Transfer L742 tek 
34.7 12.6 
Figure l. 
PRELIMINARY MEASUREMENTS 
OF CDC 7600 CPU PARALLELISM 
% Cycles Waiting to Issue 47.5% 


% Cycles Used Initiating the 


Fetch of a New Instruction Word 14.82 
MIPS 13.7% 
Elapsed Time 0.59% 


Equivalent Serial Execution Time 


Figure 2. 


PEPE EXECUTION SUMMARY 


(Measurement = 1000 Clock Periods) 


NUMBER OF INSTRUCTIONS ISSUED 


AOCU AOU ACU AU CCU cU 
227 280 86 194 437 14 


NUMBER OF REFERENCES TO ELEMENT MEMORY 


au = AUC GU 
64 132 6 


EFFECTIVE INSTRUCTION RATE 


(Issues x 10° per second) 


12.38 


Figure 3. 


PEPE INSTRUCTION STREAM ANALYSIS 


INDIVIDUAL UNIT ACTIVITY 
(Percent of Total Time) 


AOCU AOU ACU AU CCU cu 
38.1 49.5 9.5 65.6 50.3 2.9 


OVERLAPPED UNIT ACTIVITY 
(Percent of Total Time) 


AOCU/AOU ACU/AU CCU/CU 
5.3 5.8 0.7 


OVERLAPPED UNIT IDLE PERIODS 
(Percent of Total Time) 


AOCU/AOU ACU/AU CCU/CU 
17.7 30.7 47.5 
Figure 4. 
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the parallel instruction execution rates by the 
number of active elements to derive a MIPS esti- 
mate. We chose this approach for two reasons. 
First, the utility and accuracy of the above 

MIPS estimate is questionable. Second, the num- 
ber of active elements is often a function of the 
input data. We are attempting to consider the 
relationshi; between the hardware and software 
independent of the particular set of input data 
whenever possible. 


The data in Figure 4 is part of the summary 
output from the PEPE simulation for the example 
considered in the preceding paragraph. The indi- 
vidual unit activity represents the percentage of 
time that each unit was executing instructions. 
(For the SCL the S_SCLF flags were monitored to 
determine activity for each cycle. For the PICU 
the S_IREQ flag was used.) The overlapped acti- 
vity is the percentage of time that a sequential 
unit and its associated parallel unit were both 
active. The overlapped idle time is the percen- 
tage of time that a sequential and parallel unit 
were simultaneously not executing instructions. 
We present a preliminary interpretation of some 
of these values in order to give some information 
concerning the operation of PEPE and to show the 
types of analysis which can be performed using 
the simulation output. The interpretations are 
based upon the summary output and upon the cycle- 
by-cycle instruction trace, which is not shown. 


Individual Activity 


The low utilization of the ACU and the CUs 
is due to the few instructions in the skeletal 
code which execute there. The high utilization 
of the AUs is due in part to the relatively long 
execution times of many of the AU instructions. 
It is also the result of the lack of delay be- 
tween successive AU instructions. This in turn 
is due to the PIQ which mitigates the effects of 
the relatively slow (300 ns) ACU program memory 
(SAPRGM) . 


Overlapped Activity 


The low overlap of the AOCU/AOU activity, 
in spite of the many instructions executing 
there, is due to the short instruction executions 
times and the absence of a PIQ. However since 
the AOCU/AOU code sequences tend to be short and 
the presence of OTA instructions (output from an 
element A-register to the sequential A-register) 
frequent, a PIQ would probably provide little 
additional throughput. The relatively high over- 
lap between the ACU and the AUs is due to the 
PIQ. The CCU/CU overlap is low because of the 
inactivity of the CUs and the factors considered 
for the AOCU/AOU. 


Overlapped Idle 


Since the CUs are idle most of the time, the 
CCU/CU idle time is primarily due to the non- 
overlap of instruction fetch with SCL instruction 
execution. This effect is less prominant in the 
AOCU/AOU due to the greater percentage of 
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parallel instructions and thus increased overlap 
of fetch and execution. 


Space limitations prevent the inclusion of 
a copy of the dynamic instruction trace or full 
execution trace. However the instruction trace 
is useful to determine the occurrence and dura- 
tion of execution delays such as those caused by 
non-overlapped routing and execution, waiting for 
the PIQ to empty before OTA or branch instruc- 
tions, or waiting for interrupts to be completed 
or accepted. (In the current code all of these 
delays tended to be of short duration, no more 
than 3 cycles. In test cases, however, delays 
in excess of 10 cycles have been observed.) 
Periods of inactivity for a particular unit and 
the effects of code segment rearrangement are 
easily seen on the trace. The inclusion in the 
trace of instruction issues to the PIQ, as well 
as to the PICU and SCL, provides insight into 
PIQ/SCL dynamics. The full execution trace has 
proved a valuable aid in debugging the PEPE code. 


Summary 


Continued work is planned in several areas. 
The PEPE SETS code will be expanded. At the same 
time the PEPE simulation will be enhanced by the 
addition of models for the interfaces between 
Input/Output Units (I0U) and external computers. 
We will continue to measure the characteristics 
of the PEPE code and develop methods for compar- 
ing them with those of the CDC 7600 implementa-— 
tion of SETS. The measurements will also be ex- 
tended to consider the effects of input/output 
on the performance of both versions of SETS. 
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COMPUTER SIMULATION OF PEPE AND ITS HOST AT THE INSTRUCTION LEVEL 


James L. Troy 
Huntsville Operations 
System Development Corporation 
Huntsville, Alabama 35805 


Summary 


A serious problem emerges when an attempt is 
made to simulate parallel processing at the 
instruction level: execution time may be imprac- 
tically slow. The faster a parallel processor 
executes, the slower will its instruction level 
simulator execute, and PEPE is very fast. The 
first attempt to simulate a 100-element PEPE at 
the instruction level (on an IBM 360/65, executing 
a BMD problem) produced a snail's-pace real-to- 
simulated time expansion of 11,000 to 1. This 
ratio was quickly reduced to a more practical 1000 
to 1 by reducing the size of simulated element 
memory to fit the available core space. But be- 
cause of the increased complexity and power of the 
current larger scale PEPE new solutions to the 
problem of excessive time expansion were 
sought.[1] Adding to the problem, however, were 
new requirements: the executions of all three 
control units were to be simulated "simultaneous- 
ly" to accurately measure inter-unit memory access 
conflicts; element expandability (from 36 to 800 
elements) was to be provided with particular 
emphasis on the efficient simulation of a 288- 
element PEPE; and instruction time was to be accu- 
rately modeled. A CDC 7600 was selected to 
execute this simulation since it is also being 
used as PEPE's "Host" to execute sequentially- 
oriented system tasks. Its relatively fast execu- 
tion speed and large core storage are helpful in 
alleviating some of the problems inherent with 
sequential machines simulating PEPE; i.e., the use 
of such machines requires looping through many 
arrays which represent element data. The 7600, 
however, requires time-consuming data conversions 
between its 60-bit 1's complement and PEPE's 
32-bit 2's complement formats. 


The main approach taken to reduce execution 
time has been to eliminate code. There are, for 
instance, very few error checks. Erroneous condi- 
tions (such as, in PEPE, arithmetic operations 
with unnormalized floating point numbers) are not 
simulated where these conditions would surely lead 
to a program abort anyway. A preprocessing scheme 
eliminates several thousand word tables that would 
have otherwise been required in the online envir- 
onment to provide instruction routing, execution 
times, legal field combinations and other informa- 
tion. This core space savings translates into 
considerable time saved due to Large Core Memory 
access-time characteristics of the CDC-7600. 
Dynamic instruction modification is disallowed by 
the software, so some instruction execution tasks 
can be preprocessed. Data, such as illegal 
instruction flag, execution time, address field 
size (which varies) and traps for parameter test- 
ing, are stored during a preprocessing pass into 
the 28 remaining bits of the 60-bit CDC 7600 word 
reserved in the load module for each 32-bit PEPE 
instruction. 
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FORTRAN was chosen as the programming lan- 
guage though code generation has been monitored 
closely to avoid inefficient object code. 

Extended FORTRAN for the CDC 7600 provides the 
necessary shift and mask statements to manage 
packed data. It was felt that the code generated 
by a modern compiler is efficient enough and the 
programming time thus saved is better spent inter- 
preting the complexities of parallel hardware. 


The PEPE simulator is instruction-driven and 
time is incremented following the occurrence of 
events which effect time. When PEPE simulation 
is interrupted due to I/O or interrupts between 
PEPE and external equipment, control is temporar- 
ily returned to a simulation control program 
(SDC's PEPSIE) which is event/time driven and in 
charge of coordinating I/O transactions between 
PEPE and the outside world. 


The element expandability requirement pro- 
duces a data variance of a million words, far too 
varied for one all-encompassing FORTRAN data 
block. So, three simulator versions are being 
produced in which up to 36, 300 or 800 elements 
can be modeled. If fewer elements are desired the 
space required for the maximum is blocked but not 
used. The 800-element version contains a disc- 
paging algorithm in which a block of contiguous 
addresses of element memory (for all elements) is 
maintained in core. This method was chosen based 
upon tests which showed that subsequent element 
memory accesses tend to stay in one "neighborhood" 
of memory for relatively long durations. The 
36-element simulator is expected to reside in 7600 
core at all times along with "Host" programs, the 
simulation controller, and executive programs. 
Disc transfer is expected to be required for the 
300-element configuration only between the 
execution of major program segments. 


Through the use of these techniques the PEPE 
instruction level simulator is expected to be a 
valuable tool in checking out the software utility 
package for the MSI Model PEPE currently being 
constructed and to validate the hardware design of 
the current model or of future large-scale 
integration PEPE models. 
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